| Field | Response |
|:---|:---|
| Participation considerations from adversely impacted groups [protected classes](https://www.senate.ca.gov/content/protected-classes) in model design and testing: | None |
| Bias Metric (If Measured): | [BBQ Accuracy Scores in Ambiguous Contexts](https://github.com/nyu-mll/BBQ/) |
| Which characteristic (feature) show(s) the greatest difference in performance?: | The model shows high variance across many characteristics when used at a high temperature, with the greatest measurable difference seen in categories such as Gender Identity and Race x Gender. |
| Which feature(s) have the worst performance overall? | Age (ambiguous) has both the lowest category accuracy listed (0.75) and a notably negative bias score (–0.56), indicating it is the worst-performing feature overall in this evaluation. |
| Measures taken to mitigate against unwanted bias: | None |
| If using internal data, description of methods implemented in data acquisition or processing, if any, to address the prevalence of identifiable biases in the training, testing, and validation data: | The training datasets contain a large amount of synthetic data generated by LLMs. We manually curated prompts. |
| Tools used to assess statistical imbalances and highlight patterns that may introduce bias into AI models: | Bias Benchmark for Question Answering (BBQ) |
| Tools used to assess statistical imbalances and highlight patterns that may introduce bias into AI models: | The datasets, which include video datasets (e.g., YouCook2, VCG Human Dataset) and image captioning datasets, do not collectively or exhaustively represent all demographic groups (and proportionally therein).
For instance, these datasets do not contain explicit mentions of demographic classes such as age, gender, or ethnicity in over 80% of samples. In the subset where analysis was performed, certain datasets contain skews in the representation of participants—for example, perceived gender of "female" participants may be significant compared to "male" participants for certain datasets. Separately, individuals aged "40 to 49 years" and “20 to 29 years” are the most frequent among ethnic identifiers. Toxicity analysis was additionally performed on several datasets to identify potential not-safe-for-work samples and risks.
To mitigate these imbalances, we recommend considering evaluation techniques such as bias audits, fine-tuning with demographically balanced datasets, and mitigation strategies like counterfactual data augmentation to align with the desired model behavior. This evaluation was conducted on a data subset ranging from 200 to 3,000 samples per dataset; as such, certain limitations may exist in the reliability of the embeddings. A baseline of 200 samples was used across all datasets, with larger subsets of up to 3,000 samples utilized for certain in-depth analyses.
 |