We explore the extent to which zero-shot vision-language models exhibit gender bias for different vision tasks. Vision models traditionally required task-specific labels for representing concepts, as well as finetuning; zero-shot models like CLIP instead perform tasks with an open-vocabulary, meaning they do not need a fixed set of labels, by using text embeddings to represent concepts. With these capabilities in mind, we ask: Do vision-language models exhibit gender bias when performing zero-shot image classification, object detection and semantic segmentation? We evaluate different vision-language models with multiple datasets across a set of concepts and find (i) all models evaluated show distinct performance differences based on the perceived gender of the person co-occurring with a given concept in the image and that aggregating analyses over all concepts can mask these concerns; (ii) model calibration (i.e. the relationship between accuracy and confidence) also differs distinctly by perceived gender, even when evaluating on similar representations of concepts; and (iii) these observed disparities align with existing gender biases in word embeddings from language models. These findings suggest that, while language greatly expands the capability of vision tasks, it can also contribute to social biases in zero-shot vision settings. Furthermore, biases can further propagate when foundational models like CLIP are used by other models to enable zero-shot capabilities.
翻译:我们探究了零样本视觉-语言模型在不同视觉任务中表现出的性别偏见程度。传统视觉模型需要任务特定的标签来表示概念并进行微调;而CLIP等零样本模型则通过使用文本嵌入来表示概念,以开放词汇的方式执行任务,无需固定标签集。基于这些能力,我们提出以下问题:在执行零样本图像分类、目标检测和语义分割时,视觉-语言模型是否表现出性别偏见?我们使用多个数据集对不同的视觉-语言模型进行了一系列概念评估,发现:(i)所有被评估的模型均根据图像中与给定概念同时出现的人物感知性别,表现出显著的性能差异,且对所有概念进行聚合分析可能掩盖这些问题;(ii)模型校准(即准确率与置信度之间的关系)也因感知性别而存在显著差异,即使对类似的概念表征进行评估时也是如此;(iii)这些观察到的差异与语言模型中词嵌入已有的性别偏见一致。这些发现表明,虽然语言极大地扩展了视觉任务的能力,但也可能导致零样本视觉场景中的社会偏见。此外,当CLIP等基础模型被其他模型用于实现零样本能力时,偏见可能进一步传播。