Vision-language models (VLMs) are intensively used in many downstream tasks, including those requiring assessments of individuals appearing in the images. While VLMs perform well in simple single-person scenarios, in real-world applications, we often face complex situations in which there are persons of different genders doing different activities. We show that in such cases, VLMs are biased towards identifying the individual with the expected gender (according to ingrained gender stereotypes in the model or other forms of sample selection bias) as the performer of the activity. We refer to this bias in associating an activity with the gender of its actual performer in an image or text as the Gender-Activity Binding (GAB) bias and analyze how this bias is internalized in VLMs. To assess this bias, we have introduced the GAB dataset with approximately 5500 AI-generated images that represent a variety of activities, addressing the scarcity of real-world images for some scenarios. To have extensive quality control, the generated images are evaluated for their diversity, quality, and realism. We have tested 12 renowned pre-trained VLMs on this dataset in the context of text-to-image and image-to-text retrieval to measure the effect of this bias on their predictions. Additionally, we have carried out supplementary experiments to quantify the bias in VLMs' text encoders and to evaluate VLMs' capability to recognize activities. Our experiments indicate that VLMs experience an average performance decline of about 13.2% when confronted with gender-activity binding bias.
翻译:视觉语言模型(VLMs)已广泛应用于众多下游任务,包括需要对图像中出现的个体进行评估的任务。尽管VLMs在简单的单人场景中表现良好,但在实际应用中,我们常常面临复杂情境,其中涉及不同性别个体从事不同活动的情况。我们发现,在此类情况下,VLMs倾向于将符合预期性别(根据模型中根深蒂固的性别刻板印象或其他形式的样本选择偏见)的个体识别为活动的执行者。我们将这种在图像或文本中将活动与其实际执行者性别相关联的偏见称为性别-活动绑定(GAB)偏见,并分析了该偏见在VLMs中的内在化机制。为评估此偏见,我们构建了包含约5500张AI生成图像的GAB数据集,这些图像涵盖了多种活动场景,以解决某些情境下真实世界图像稀缺的问题。为进行全面的质量控制,我们对生成图像的多样性、质量及真实感进行了评估。我们在该数据集上测试了12个知名预训练VLMs在文本到图像和图像到文本检索任务中的表现,以衡量该偏见对其预测的影响。此外,我们还进行了补充实验以量化VLMs文本编码器中的偏见,并评估VLMs识别活动的能力。实验结果表明,当面临性别-活动绑定偏见时,VLMs的平均性能下降约13.2%。