The emergence of Large Vision-Language Models (LVLMs) marks significant strides towards achieving general artificial intelligence. However, these advancements are accompanied by concerns about biased outputs, a challenge that has yet to be thoroughly explored. Existing benchmarks are not sufficiently comprehensive in evaluating biases due to their limited data scale, single questioning format and narrow sources of bias. To address this problem, we introduce VLBiasBench, a comprehensive benchmark designed to evaluate biases in LVLMs. VLBiasBench, features a dataset that covers nine distinct categories of social biases, including age, disability status, gender, nationality, physical appearance, race, religion, profession, social economic status, as well as two intersectional bias categories: race x gender and race x social economic status. To build a large-scale dataset, we use Stable Diffusion XL model to generate 46,848 high-quality images, which are combined with various questions to creat 128,342 samples. These questions are divided into open-ended and close-ended types, ensuring thorough consideration of bias sources and a comprehensive evaluation of LVLM biases from multiple perspectives. We conduct extensive evaluations on 15 open-source models as well as two advanced closed-source models, yielding new insights into the biases present in these models. Our benchmark is available at https://github.com/Xiangkui-Cao/VLBiasBench.
翻译:大型视觉语言模型的出现标志着向通用人工智能迈出了重要步伐。然而,这些进展伴随着对偏见输出的担忧,这一挑战尚未得到充分探索。现有基准因其数据规模有限、提问格式单一以及偏见来源狭窄,在评估偏见方面不够全面。为解决这一问题,我们提出了VLBiasBench,一个旨在评估大型视觉语言模型偏见的综合基准。VLBiasBench所包含的数据集覆盖九种不同的社会偏见类别,包括年龄、残疾状况、性别、国籍、外貌、种族、宗教、职业及社会经济地位,以及两个交叉偏见类别:种族×性别和种族×社会经济地位。为构建大规模数据集,我们使用Stable Diffusion XL模型生成了46,848张高质量图像,并结合各类问题创建了128,342个样本。这些问题分为开放式和封闭式两种类型,确保对偏见来源进行全面考虑,并从多角度实现对大型视觉语言模型偏见的综合评价。我们对15个开源模型以及两个先进闭源模型进行了广泛评估,从中获得了关于这些模型中偏见的新见解。我们的基准公开于https://github.com/Xiangkui-Cao/VLBiasBench。