As Vision Language Models (VLMs) become integral to real-world applications, understanding their demographic biases is critical. We introduce GRAS, a benchmark for uncovering demographic biases in VLMs across gender, race, age, and skin tone, offering the most diverse coverage to date. We further propose the GRAS Bias Score, an interpretable metric for quantifying bias. We benchmark five state-of-the-art VLMs and reveal concerning bias levels, with the least biased model attaining a GRAS Bias Score of only 2 out of 100. Our findings also reveal a methodological insight: evaluating bias in VLMs with visual question answering (VQA) requires considering multiple formulations of a question. Our code, data, and evaluation results are publicly available.
翻译:随着视觉语言模型(VLMs)日益融入现实应用,理解其人口统计学偏差变得至关重要。本文提出GRAS,这是一个用于揭示VLMs在性别、种族、年龄和肤色方面人口统计学偏差的基准测试,提供了迄今为止最多样化的覆盖范围。我们进一步提出了GRAS偏差分数,一种用于量化偏差的可解释度量指标。我们对五种最先进的VLM进行了基准测试,揭示了令人担忧的偏差水平,其中偏差最小的模型GRAS偏差分数仅为100分中的2分。我们的研究还揭示了一个方法论上的洞见:使用视觉问答(VQA)评估VLM的偏差时,需要考虑同一问题的多种表述形式。我们的代码、数据和评估结果均已公开。