Large vision-language models (VLMs) are widely getting adopted in industry and academia. In this work we build a unified framework to systematically evaluate gender-profession bias in VLMs. Our evaluation encompasses all supported inference modes of the recent VLMs, including image-to-text, text-to-text, text-to-image, and image-to-image. We construct a synthetic, high-quality dataset of text and images that blurs gender distinctions across professional actions to benchmark gender bias. In our benchmarking of recent vision-language models (VLMs), we observe that different input-output modalities result in distinct bias magnitudes and directions. We hope our work will help guide future progress in improving VLMs to learn socially unbiased representations. We will release our data and code.
翻译:大型视觉语言模型(VLM)正广泛被工业界和学术界采用。本文构建了一个统一框架,系统性地评估VLM中的性别-职业偏见。我们的评估涵盖当前VLM支持的所有推理模式,包括图像到文本、文本到文本、文本到图像和图像到图像。我们构建了一个合成的高质量文本与图像数据集,该数据集模糊了专业行为中的性别区分,用于基准测试性别偏见。在对当前视觉语言模型的基准测试中,我们发现不同的输入-输出模态会导致不同的偏见程度和方向。我们希望我们的工作能帮助指导未来改进VLM以学习无社会偏见表征的研究进展。我们将公开我们的数据和代码。