Vision-language models, like CLIP (Contrastive Language Image Pretraining), are becoming increasingly popular for a wide range of multimodal retrieval tasks. However, prior work has shown that large language and deep vision models can learn historical biases contained in their training sets, leading to perpetuation of stereotypes and potential downstream harm. In this work, we conduct a systematic analysis of the social biases that are present in CLIP, with a focus on the interaction between image and text modalities. We first propose a taxonomy of social biases called So-B-IT, which contains 374 words categorized across ten types of bias. Each type can lead to societal harm if associated with a particular demographic group. Using this taxonomy, we examine images retrieved by CLIP from a facial image dataset using each word as part of a prompt. We find that CLIP frequently displays undesirable associations between harmful words and specific demographic groups, such as retrieving mostly pictures of Middle Eastern men when asked to retrieve images of a "terrorist". Finally, we conduct an analysis of the source of such biases, by showing that the same harmful stereotypes are also present in a large image-text dataset used to train CLIP models for examples of biases that we find. Our findings highlight the importance of evaluating and addressing bias in vision-language models, and suggest the need for transparency and fairness-aware curation of large pre-training datasets.
翻译:视觉语言模型,如CLIP(对比语言图像预训练),正日益广泛地应用于多模态检索任务。然而,先前研究表明,大型语言模型和深度视觉模型可能习得训练数据中蕴含的历史性偏见,从而导致刻板印象的延续和潜在的下游危害。本研究对CLIP模型中存在的社会偏见进行了系统性分析,重点关注图像与文本模态间的交互作用。我们首先提出了名为So-B-IT的社会偏见分类体系,该体系包含374个词汇,涵盖十种偏见类型。每种类型若与特定人口群体关联,均可能引发社会危害。基于该分类体系,我们使用每个词汇作为提示词,通过CLIP从人脸图像数据集中检索对应图像。研究发现,CLIP频繁表现出有害词汇与特定人口群体间的不良关联,例如当要求检索"恐怖分子"图像时,模型主要返回中东男性的照片。最后,我们通过证明训练CLIP模型所用的大规模图文数据集中同样存在我们发现的偏见案例,深入分析了此类偏见的来源。我们的研究结果凸显了评估和解决视觉语言模型中偏见问题的重要性,并表明需要对大规模预训练数据集进行透明化处理和公平性导向的策展。