Following on recent advances in large language models (LLMs) and subsequent chat models, a new wave of large vision-language models (LVLMs) has emerged. Such models can incorporate images as input in addition to text, and perform tasks such as visual question answering, image captioning, story generation, etc. Here, we examine potential gender and racial biases in such systems, based on the perceived characteristics of the people in the input images. To accomplish this, we present a new dataset PAIRS (PArallel Images for eveRyday Scenarios). The PAIRS dataset contains sets of AI-generated images of people, such that the images are highly similar in terms of background and visual content, but differ along the dimensions of gender (man, woman) and race (Black, white). By querying the LVLMs with such images, we observe significant differences in the responses according to the perceived gender or race of the person depicted.
翻译:继大型语言模型(LLMs)及后续对话模型的近期进展之后,新一代大型视觉-语言模型(LVLMs)应运而生。此类模型除文本外,还能以图像作为输入,执行视觉问答、图像描述、故事生成等任务。本文基于输入图像中人物的感知特征,考察此类系统中潜在的性别与种族偏见。为此,我们提出一个新型数据集PAIRS(日常场景平行图像集)。该PAIRS数据集包含多组AI生成的人物图像,这些图像在背景和视觉内容上高度相似,但在性别(男性、女性)与种族(黑人、白人)维度上存在差异。通过向LVLMs输入此类图像,我们观察到模型根据所描绘人物的感知性别或种族,在回应中表现出显著差异。