The increasing tendency to collect large and uncurated datasets to train vision-and-language models has raised concerns about fair representations. It is known that even small but manually annotated datasets, such as MSCOCO, are affected by societal bias. This problem, far from being solved, may be getting worse with data crawled from the Internet without much control. In addition, the lack of tools to analyze societal bias in big collections of images makes addressing the problem extremely challenging. Our first contribution is to annotate part of the Google Conceptual Captions dataset, widely used for training vision-and-language models, with four demographic and two contextual attributes. Our second contribution is to conduct a comprehensive analysis of the annotations, focusing on how different demographic groups are represented. Our last contribution lies in evaluating three prevailing vision-and-language tasks: image captioning, text-image CLIP embeddings, and text-to-image generation, showing that societal bias is a persistent problem in all of them.
翻译:日益倾向于收集大规模且未经筛选的数据集来训练视觉-语言模型,引发了对公平表征的担忧。众所周知,即便是像MSCOCO这样经过人工标注的小型数据集,也受到社会偏见的影响。这一问题远未得到解决,反而可能因从互联网上不加节制地抓取数据而日益恶化。此外,缺乏分析大规模图像集合中社会偏见的工具,使得解决该问题极具挑战性。本文的第一项贡献是对广泛用于训练视觉-语言模型的Google Conceptual Captions数据集的部分内容进行标注,涉及四个人口统计属性和两个上下文属性。第二项贡献是对标注结果进行全面分析,重点关注不同人口统计群体的表征方式。最后一项贡献是评估三种主流的视觉-语言任务:图像描述、文本-图像CLIP嵌入以及文本到图像生成,结果表明社会偏见在所有任务中都是一个持续存在的问题。