We introduce VisoGender, a novel dataset for benchmarking gender bias in vision-language models. We focus on occupation-related biases within a hegemonic system of binary gender, inspired by Winograd and Winogender schemas, where each image is associated with a caption containing a pronoun relationship of subjects and objects in the scene. VisoGender is balanced by gender representation in professional roles, supporting bias evaluation in two ways: i) resolution bias, where we evaluate the difference between pronoun resolution accuracies for image subjects with gender presentations perceived as masculine versus feminine by human annotators and ii) retrieval bias, where we compare ratios of professionals perceived to have masculine and feminine gender presentations retrieved for a gender-neutral search query. We benchmark several state-of-the-art vision-language models and find that they demonstrate bias in resolving binary gender in complex scenes. While the direction and magnitude of gender bias depends on the task and the model being evaluated, captioning models are generally less biased than Vision-Language Encoders. Dataset and code are available at https://github.com/oxai/visogender
翻译:我们提出了VisoGender,一个用于基准测试视觉-语言模型中性别偏见的新型数据集。受Winograd和Winogender模式的启发,我们聚焦于二元性别霸权体系内与职业相关的偏见,其中每张图像关联一个包含场景中主客体代词关系的描述文本。VisoGender在职业角色上的性别表征是平衡的,支持以下两种方式的偏见评估:i) 消解偏见,即评估人类标注者感知为男性化与女性化性别呈现的图像主体在代词消解准确率上的差异;ii) 检索偏见,即比较在性别中性搜索查询中检索到的、感知为男性化与女性化性别呈现的职业人士比率。我们对多个前沿视觉-语言模型进行了基准测试,发现它们在复杂场景中消解二元性别时均表现出偏见。尽管性别偏见的指向和程度取决于待评估的任务和模型,但描述生成模型通常比视觉-语言编码器偏见更少。数据集和代码可从https://github.com/oxai/visogender获取。