Severe data imbalance naturally exists among web-scale vision-language datasets. Despite this, we find CLIP pre-trained thereupon exhibits notable robustness to the data imbalance compared to supervised learning, and demonstrates significant effectiveness in learning generalizable representations. With an aim to investigate the reasons behind this finding, we conduct controlled experiments to study various underlying factors, and reveal that CLIP's pretext task forms a dynamic classification problem wherein only a subset of classes is present in training. This isolates the bias from dominant classes and implicitly balances the learning signal. Furthermore, the robustness and discriminability of CLIP improve with more descriptive language supervision, larger data scale, and broader open-world concepts, which are inaccessible to supervised learning. Our study not only uncovers the mechanisms behind CLIP's generalizability beyond data imbalance but also provides transferable insights for the research community. The findings are validated in both supervised and self-supervised learning, enabling models trained on imbalanced data to achieve CLIP-level performance on diverse recognition tasks. Code will be available at: https://github.com/CVMI-Lab/clip-beyond-tail.
翻译:网络规模视觉-语言数据集中普遍存在严重的数据不平衡现象。尽管如此,我们发现基于此类数据预训练的CLIP模型相较于监督学习展现出对数据不平衡的显著鲁棒性,并在学习可泛化表征方面表现出卓越效能。为探究这一现象背后的原因,我们通过控制实验研究了多种潜在影响因素,揭示出CLIP的预训练任务构成了动态分类问题——训练过程中仅出现类别子集。这种机制隔离了主导类别的偏差并隐式平衡了学习信号。此外,CLIP的鲁棒性与判别能力随着更具描述性的语言监督、更大数据规模及更广泛的开放世界概念而提升,而这些要素是监督学习无法获取的。本研究不仅揭示了CLIP在数据不平衡条件下保持泛化能力的内在机制,更为研究社区提供了可迁移的学术洞见。这些发现在监督学习与自监督学习体系中均得到验证,使得基于不平衡数据训练的模型能在多样化识别任务中达到CLIP级别的性能。代码发布地址:https://github.com/CVMI-Lab/clip-beyond-tail。