The emergence of Large Vision-Language Models (LVLMs) marks significant strides towards achieving general artificial intelligence. However, these advancements are tempered by the outputs that often reflect biases, a concern not yet extensively investigated. Existing benchmarks are not sufficiently comprehensive in evaluating biases due to their limited data scale, single questioning format and narrow sources of bias. To address this problem, we introduce VLBiasBench, a benchmark aimed at evaluating biases in LVLMs comprehensively. In VLBiasBench, we construct a dataset encompassing nine distinct categories of social biases, including age, disability status, gender, nationality, physical appearance, race, religion, profession, social economic status and two intersectional bias categories (race x gender, and race x social economic status). To create a large-scale dataset, we use Stable Diffusion XL model to generate 46,848 high-quality images, which are combined with different questions to form 128,342 samples. These questions are categorized into open and close ended types, fully considering the sources of bias and comprehensively evaluating the biases of LVLM from multiple perspectives. We subsequently conduct extensive evaluations on 15 open-source models as well as one advanced closed-source model, providing some new insights into the biases revealing from these models. Our benchmark is available at https://github.com/Xiangkui-Cao/VLBiasBench.
翻译:大型视觉语言模型(LVLM)的出现标志着在实现通用人工智能方面取得了重大进展。然而,这些进步常常因其输出中反映出的偏见而受到制约,这一问题尚未得到广泛研究。现有基准由于数据规模有限、提问形式单一以及偏见来源狭窄,在评估偏见方面不够全面。为解决此问题,我们引入了VLBiasBench,这是一个旨在全面评估LVLM中偏见的基准。在VLBiasBench中,我们构建了一个涵盖九种不同社会偏见类别的数据集,包括年龄、残疾状况、性别、国籍、外貌、种族、宗教、职业、社会经济地位,以及两个交叉偏见类别(种族×性别,以及种族×社会经济地位)。为创建大规模数据集,我们使用Stable Diffusion XL模型生成了46,848张高质量图像,这些图像与不同问题相结合,形成了128,342个样本。这些问题被分为开放式和封闭式两种类型,充分考虑了偏见的来源,并从多个角度全面评估了LVLM的偏见。随后,我们对15个开源模型以及一个先进的闭源模型进行了广泛评估,为揭示这些模型中存在的偏见提供了一些新的见解。我们的基准可在 https://github.com/Xiangkui-Cao/VLBiasBench 获取。