BBQ-V: Benchmarking Visual Stereotype Bias in Large Multimodal Models

Stereotype biases in Large Multimodal Models (LMMs) perpetuate harmful societal prejudices, undermining the fairness and equity of AI applications. As LMMs grow increasingly influential, addressing and mitigating inherent biases related to stereotypes, harmful generations, and ambiguous assumptions in real-world scenarios has become essential. However, existing datasets evaluating stereotype biases in LMMs often lack diversity, rely on synthetic images, and often have single-actor images, leaving a gap in bias evaluation for real-world visual contexts. To address the gap in bias evaluation using real images, we introduce the BBQ-Vision (BBQ-V), the most comprehensive framework for assessing stereotype biases across nine diverse categories and 50 sub-categories with real and multi-actor images. BBQ-V benchmark contains 14,144 image-question pairs and rigorously evaluates LMMs through carefully curated, visually grounded scenarios, challenging them to reason accurately about visual stereotypes. It offers a robust evaluation framework featuring real-world visual samples, image variations, and open-ended question formats. BBQ-V enables a precise and nuanced assessment of a model's reasoning capabilities across varying levels of difficulty. Through rigorous testing of 19 state-of-the-art open-source (general-purpose and reasoning) and closed-source LMMs, we highlight that these top-performing models are often biased on several social stereotypes, and demonstrate that the thinking models induce more bias in the reasoning chains. This benchmark represents a significant step toward fostering fairness in AI systems and reducing harmful biases, laying the groundwork for more equitable and socially responsible LMMs. Our dataset and evaluation code are publicly available.

翻译：大型多模态模型（LMMs）中的刻板印象偏见会延续有害的社会偏见，损害人工智能应用的公平性与公正性。随着LMMs的影响力日益增强，解决并缓解其在现实场景中与刻板印象、有害生成及模糊假设相关的固有偏见已变得至关重要。然而，现有评估LMMs刻板偏见的数据集往往缺乏多样性，依赖合成图像，且多为单人物图像，导致在真实视觉场景的偏见评估方面存在空白。为填补使用真实图像进行偏见评估的空白，我们提出了BBQ-Vision（BBQ-V），这是目前最全面的框架，利用真实且包含多人物的图像，评估涵盖九大类别和五十个子类别的刻板印象偏见。BBQ-V基准包含14,144个图像-问题对，通过精心设计的、基于视觉的场景对LMMs进行严格评估，挑战其在视觉刻板印象上的准确推理能力。该框架提供了一个稳健的评估体系，其特色包括真实世界的视觉样本、图像变体以及开放式问题形式。BBQ-V能够对不同难度级别下模型的推理能力进行精确而细致的评估。通过对19个最先进的开源（通用型和推理型）与闭源LMMs的严格测试，我们发现这些表现优异的模型在多种社会刻板印象上往往存在偏见，并证明思维模型在推理链中会引发更多偏见。该基准代表了在促进人工智能系统公平性、减少有害偏见方面迈出的重要一步，为构建更公平、更具社会责任感的LMMs奠定了基础。我们的数据集与评估代码已公开提供。