Vision-language models (VLMs) have gained widespread adoption in both industry and academia. In this study, we propose a unified framework for systematically evaluating gender, race, and age biases in VLMs with respect to professions. Our evaluation encompasses all supported inference modes of the recent VLMs, including image-to-text, text-to-text, text-to-image, and image-to-image. Additionally, we propose an automated pipeline to generate high-quality synthetic datasets that intentionally conceal gender, race, and age information across different professional domains, both in generated text and images. The dataset includes action-based descriptions of each profession and serves as a benchmark for evaluating societal biases in vision-language models (VLMs). In our comparative analysis of widely used VLMs, we have identified that varying input-output modalities lead to discernible differences in bias magnitudes and directions. Additionally, we find that VLM models exhibit distinct biases across different bias attributes we investigated. We hope our work will help guide future progress in improving VLMs to learn socially unbiased representations. We will release our data and code.
翻译:视觉语言模型(VLMs)已在工业界和学术界获得广泛应用。本研究提出一个统一框架,用于系统评估VLMs在职业相关情境中存在的性别、种族和年龄偏见。我们的评估涵盖了近期VLMs所有支持的推理模式,包括图像到文本、文本到文本、文本到图像以及图像到图像。此外,我们设计了一个自动化流程来生成高质量的合成数据集,该数据集在生成的文本和图像中刻意隐藏了不同专业领域的性别、种族和年龄信息。该数据集包含基于行动的职业描述,可作为评估视觉语言模型(VLMs)社会偏见的基准。在对广泛使用的VLMs进行比较分析时,我们发现不同的输入-输出模态会导致偏见程度和方向的显著差异。同时,研究显示VLM模型在我们考察的不同偏见属性上表现出差异性偏见。我们希望这项工作能为改进VLMs学习无社会偏见的表征提供指引。我们将公开相关数据与代码。