Who's in and who's out? A case study of multimodal CLIP-filtering in DataComp

from arxiv, Content warning: This paper discusses societal stereotypes and sexually-explicit material that may be disturbing, distressing, and/or offensive to the reader

As training datasets become increasingly drawn from unstructured, uncontrolled environments such as the web, researchers and industry practitioners have increasingly relied upon data filtering techniques to "filter out the noise" of web-scraped data. While datasets have been widely shown to reflect the biases and values of their creators, in this paper we contribute to an emerging body of research that assesses the filters used to create these datasets. We show that image-text data filtering also has biases and is value-laden, encoding specific notions of what is counted as "high-quality" data. In our work, we audit a standard approach of image-text CLIP-filtering on the academic benchmark DataComp's CommonPool by analyzing discrepancies of filtering through various annotation techniques across multiple modalities of image, text, and website source. We find that data relating to several imputed demographic groups -- such as LGBTQ+ people, older women, and younger men -- are associated with higher rates of exclusion. Moreover, we demonstrate cases of exclusion amplification: not only are certain marginalized groups already underrepresented in the unfiltered data, but CLIP-filtering excludes data from these groups at higher rates. The data-filtering step in the machine learning pipeline can therefore exacerbate representation disparities already present in the data-gathering step, especially when existing filters are designed to optimize a specifically-chosen downstream performance metric like zero-shot image classification accuracy. Finally, we show that the NSFW filter fails to remove sexually-explicit content from CommonPool, and that CLIP-filtering includes several categories of copyrighted content at high rates. Our conclusions point to a need for fundamental changes in dataset creation and filtering practices.

翻译：随着训练数据集越来越多地来源于非结构化、不受控制的环境（如网络），研究人员和业界从业者越来越依赖数据过滤技术来“过滤掉”网络爬取数据的“噪声”。虽然数据集已被广泛证明反映了其创建者的偏见和价值观，但在本文中，我们为一项新兴的研究领域做出了贡献，该领域评估了用于创建这些数据集的过滤器。我们表明，图像-文本数据过滤也存在偏见并承载价值，编码了关于什么被视为“高质量”数据的特定概念。在我们的工作中，我们通过对学术基准DataComp的CommonPool上标准的图像-文本CLIP过滤方法进行审计，通过多种注释技术分析过滤在图像、文本和网站来源等多模态上的差异。我们发现，与几个归因人口群体（如LGBTQ+群体、老年女性和年轻男性）相关的数据被排除的比例更高。此外，我们展示了排除扩大的案例：不仅某些边缘化群体在未过滤的数据中已经代表性不足，而且CLIP过滤以更高的比例排除了这些群体的数据。因此，机器学习流程中的数据过滤步骤可能加剧数据收集步骤中已经存在的代表性差异，尤其是当现有过滤器被设计为优化特定选定的下游性能指标（如零样本图像分类准确率）时。最后，我们表明NSFW过滤器未能从CommonPool中移除色情内容，并且CLIP过滤以高比例包含了多类受版权保护的内容。我们的结论表明，数据集创建和过滤实践亟需根本性变革。

相关内容

GROUP

关注 1

Group一直是研究计算机支持的合作工作、人机交互、计算机支持的协作学习和社会技术研究的主要场所。该会议将社会科学、计算机科学、工程、设计、价值观以及其他与小组工作相关的多个不同主题的工作结合起来，并进行了广泛的概念化。官网链接：https://group.acm.org/conferences/group20/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

生成性对抗网络:理论模型、评估指标和最近发展的概述，Generative Adversarial Networks (GANs): An Overview of Theoretical Model, Evaluation Metrics, and Recent Developments

专知会员服务

42+阅读 · 2020年5月30日

【WSDM2020】超越统计关系：将知识关系整合到多标签音乐风格分类的风格关联中（附pdf）

专知会员服务

18+阅读 · 2019年11月23日