Scale the model, scale the data, scale the GPU farms is the reigning sentiment in the world of generative AI today. While model scaling has been extensively studied, data scaling and its downstream impacts on model performance remain under-explored. This is particularly important in the context of multimodal datasets whose main source is the World Wide Web, condensed and packaged as the Common Crawl dump, which is known to exhibit numerous drawbacks. In this paper, we evaluate the downstream impact of dataset scaling on 14 visio-linguistic models (VLMs) trained on the LAION400-M and LAION-2B datasets by measuring racial and gender bias using the Chicago Face Dataset (CFD) as the probe. Our results show that as the training data increased, the probability of a pre-trained CLIP model misclassifying human images as offensive non-human classes such as chimpanzee, gorilla, and orangutan decreased, but misclassifying the same images as human offensive classes such as criminal increased. Furthermore, of the 14 Vision Transformer-based VLMs we evaluated, the probability of predicting an image of a Black man and a Latino man as criminal increases by 65% and 69%, respectively, when the dataset is scaled from 400M to 2B samples for the larger ViT-L models. Conversely, for the smaller base ViT-B models, the probability of predicting an image of a Black man and a Latino man as criminal decreases by 20% and 47%, respectively, when the dataset is scaled from 400M to 2B samples. We ground the model audit results in a qualitative and historical analysis, reflect on our findings and their implications for dataset curation practice, and close with a summary of mitigation mechanisms and ways forward. Content warning: This article contains racially dehumanising and offensive descriptions.
翻译:在当前生成式AI领域,主流观点是"扩展模型、扩展数据、扩展GPU集群"。尽管模型扩展已被广泛研究,但数据扩展及其对模型性能的下游影响仍待深入探索。这在以万维网为来源的多模态数据集中尤为关键——这些数据集经压缩打包后形成Common Crawl快照,而该数据集本身存在诸多缺陷。本文通过使用芝加哥人脸数据集(CFD)作为探测工具,评估了在LAION400-M和LAION-2B数据集上训练的14个视觉语言模型(VLM)中,数据集扩展对种族和性别偏见的向下游影响。结果表明:随着训练数据增加,预训练CLIP模型将人类图像误分类为黑猩猩、大猩猩、猩猩等非人类冒犯性类别的概率降低,但将相同图像误分类为罪犯等人类冒犯性类别的概率却上升。更具体而言,在评估的14个基于Vision Transformer的VLM中,当较大的ViT-L模型的数据集从4亿样本扩展至20亿样本时,将黑人男性与拉丁裔男性图像预测为罪犯的概率分别增加65%和69%。相反,对于较小的基础ViT-B模型,相同数据扩展条件下,将黑人男性与拉丁裔男性图像预测为罪犯的概率分别下降20%和47%。我们将模型审计结果置于定性与历史分析中,反思研究发现及其对数据集构建实践的影响,最后总结缓解机制与未来发展方向。内容提示:本文包含非人化种族歧视及冒犯性描述。