We study the effectiveness of data-balancing for mitigating biases in contrastive language-image pretraining (CLIP), identifying areas of strength and limitation. First, we reaffirm prior conclusions that CLIP models can inadvertently absorb societal stereotypes. To counter this, we present a novel algorithm, called Multi-Modal Moment Matching (M4), designed to reduce both representation and association biases (i.e. in first- and second-order statistics) in multimodal data. We use M4 to conduct an in-depth analysis taking into account various factors, such as the model, representation, and data size. Our study also explores the dynamic nature of how CLIP learns and unlearns biases. In particular, we find that fine-tuning is effective in countering representation biases, though its impact diminishes for association biases. Also, data balancing has a mixed impact on quality: it tends to improve classification but can hurt retrieval. Interestingly, data and architectural improvements seem to mitigate the negative impact of data balancing on performance; e.g. applying M4 to SigLIP-B/16 with data quality filters improves COCO image-to-text retrieval @5 from 86% (without data balancing) to 87% and ImageNet 0-shot classification from 77% to 77.5%! Finally, we conclude with recommendations for improving the efficacy of data balancing in multimodal systems.
翻译:我们研究了数据平衡在缓解对比语言-图像预训练(CLIP)中偏见方面的有效性,指出了其优势与局限性。首先,我们重申了先前的结论,即CLIP模型可能无意中吸收社会刻板印象。为应对这一问题,我们提出了一种新算法——多模态矩匹配(M4),旨在减少多模态数据中的表征偏见和关联偏见(即一阶和二阶统计量)。我们利用M4进行了深入分析,考虑了模型、表征和数据规模等多种因素。本研究还探讨了CLIP学习与消除偏见的动态特性。特别地,我们发现微调在对抗表征偏见方面有效,但对关联偏见的缓解效果较弱。此外,数据平衡对质量有不同的影响:它倾向于提升分类性能,但可能损害检索效果。有趣的是,数据和架构的改进似乎能减轻数据平衡对性能的负面影响;例如,结合数据质量过滤器将M4应用于SigLIP-B/16,使COCO图像到文本检索@5从86%(无数据平衡)提升至87%,ImageNet零样本分类从77%提升至77.5%!最后,我们提出了提升多模态系统中数据平衡效用的建议。