Recent dataset deduplication techniques have demonstrated that content-aware dataset pruning can dramatically reduce the cost of training Vision-Language Pretrained (VLP) models without significant performance losses compared to training on the original dataset. These results have been based on pruning commonly used image-caption datasets collected from the web -- datasets that are known to harbor harmful social biases that may then be codified in trained models. In this work, we evaluate how deduplication affects the prevalence of these biases in the resulting trained models and introduce an easy-to-implement modification to the recent SemDeDup algorithm that can reduce the negative effects that we observe. When examining CLIP-style models trained on deduplicated variants of LAION-400M, we find our proposed FairDeDup algorithm consistently leads to improved fairness metrics over SemDeDup on the FairFace and FACET datasets while maintaining zero-shot performance on CLIP benchmarks.
翻译:近期数据集去重技术表明,与在原始数据集上训练相比,内容感知的数据集剪枝可大幅降低视觉语言预训练模型的训练成本,且性能损失不显著。此类结果基于对从网络收集的常见图像-描述数据集进行剪枝——这些数据集已知含有有害的社会偏见,且可能被编码至训练模型中。本研究评估去重对训练模型内此类偏见普遍性的影响,并提出对近期SemDeDup算法的易实现改进,以减轻观察到的负面效应。在针对LAION-400M去重变体训练的CLIP风格模型实验中,我们提出的FairDeDup算法在FairFace与FACET数据集上的公平性指标始终优于SemDeDup,同时保持CLIP基准的零样本性能。