We introduce VisoGender, a novel dataset for benchmarking gender bias in vision-language models. We focus on occupation-related gender biases, inspired by Winograd and Winogender schemas, where each image is associated with a caption containing a pronoun relationship of subjects and objects in the scene. VisoGender is balanced by gender representation in professional roles, supporting bias evaluation in two ways: i) resolution bias, where we evaluate the difference between gender resolution accuracies for men and women and ii) retrieval bias, where we compare ratios of male and female professionals retrieved for a gender-neutral search query. We benchmark several state-of-the-art vision-language models and find that they lack the reasoning abilities to correctly resolve gender in complex scenes. While the direction and magnitude of gender bias depends on the task and the model being evaluated, captioning models generally are more accurate and less biased than CLIP-like models. Dataset and code are available at https://github.com/oxai/visogender
翻译:我们提出VisoGender,一个用于基准测试视觉语言模型中性别偏见的新型数据集。该数据集聚焦于职业相关的性别偏见,受Winograd和Winogender模式的启发,每张图像均配有一个包含场景中主体与客体代词关系的描述文本。VisoGender通过职业角色的性别表征实现平衡,支持两种方式的偏见评估:(i)消解偏见——评估男性和女性性别消解准确率的差异;(ii)检索偏见——比较中性搜索查询中检索到的男性与女性专业人员比例。我们对多种前沿视觉语言模型进行基准测试,发现这些模型在复杂场景中缺乏正确消解性别的推理能力。尽管性别偏见的倾向与程度因任务及评估模型而异,但描述生成模型通常比CLIP类模型更准确且偏见更少。数据集与代码发布于https://github.com/oxai/visogender。