As training datasets become increasingly drawn from unstructured, uncontrolled environments such as the web, researchers and industry practitioners have increasingly relied upon data filtering techniques to "filter out the noise" of web-scraped data. While datasets have been widely shown to reflect the biases and values of their creators, in this paper we contribute to an emerging body of research that assesses the filters used to create these datasets. We show that image-text data filtering also has biases and is value-laden, encoding specific notions of what is counted as "high-quality" data. In our work, we audit a standard approach of image-text CLIP-filtering on the academic benchmark DataComp's CommonPool by analyzing discrepancies of filtering through various annotation techniques across multiple modalities of image, text, and website source. We find that data relating to several imputed demographic groups -- such as LGBTQ+ people, older women, and younger men -- are associated with higher rates of exclusion. Moreover, we demonstrate cases of exclusion amplification: not only are certain marginalized groups already underrepresented in the unfiltered data, but CLIP-filtering excludes data from these groups at higher rates. The data-filtering step in the machine learning pipeline can therefore exacerbate representation disparities already present in the data-gathering step, especially when existing filters are designed to optimize a specifically-chosen downstream performance metric like zero-shot image classification accuracy. Finally, we show that the NSFW filter fails to remove sexually-explicit content from CommonPool, and that CLIP-filtering includes several categories of copyrighted content at high rates. Our conclusions point to a need for fundamental changes in dataset creation and filtering practices.
翻译:随着训练数据集越来越多地来源于非结构化、不受控制的环境(如网络),研究人员和业界从业者越来越依赖数据过滤技术来“过滤掉”网络爬取数据的“噪声”。虽然数据集已被广泛证明反映了其创建者的偏见和价值观,但在本文中,我们为一项新兴的研究领域做出了贡献,该领域评估了用于创建这些数据集的过滤器。我们表明,图像-文本数据过滤也存在偏见并承载价值,编码了关于什么被视为“高质量”数据的特定概念。在我们的工作中,我们通过对学术基准DataComp的CommonPool上标准的图像-文本CLIP过滤方法进行审计,通过多种注释技术分析过滤在图像、文本和网站来源等多模态上的差异。我们发现,与几个归因人口群体(如LGBTQ+群体、老年女性和年轻男性)相关的数据被排除的比例更高。此外,我们展示了排除扩大的案例:不仅某些边缘化群体在未过滤的数据中已经代表性不足,而且CLIP过滤以更高的比例排除了这些群体的数据。因此,机器学习流程中的数据过滤步骤可能加剧数据收集步骤中已经存在的代表性差异,尤其是当现有过滤器被设计为优化特定选定的下游性能指标(如零样本图像分类准确率)时。最后,我们表明NSFW过滤器未能从CommonPool中移除色情内容,并且CLIP过滤以高比例包含了多类受版权保护的内容。我们的结论表明,数据集创建和过滤实践亟需根本性变革。