Recently, multimodal contrastive learning (MMCL) approaches, such as CLIP, have achieved a remarkable success in learning representations that are robust against distribution shift and generalize to new domains. Despite the empirical success, the mechanism behind learning such generalizable representations is not understood. In this work, we rigorously analyze this problem and uncover two mechanisms behind MMCL's robustness: \emph{intra-class contrasting}, which allows the model to learn features with a high variance, and \emph{inter-class feature sharing}, where annotated details in one class help learning other classes better. Both mechanisms prevent spurious features that are over-represented in the training data to overshadow the generalizable core features. This yields superior zero-shot classification accuracy under distribution shift. Furthermore, we theoretically demonstrate the benefits of using rich captions on robustness and explore the effect of annotating different types of details in the captions. We validate our theoretical findings through experiments, including a well-designed synthetic experiment and an experiment involving training CLIP models on MSCOCO/Conceptual Captions and evaluating them on shifted ImageNets.
翻译:近期,多模态对比学习(MMCL)方法(如CLIP)在学习对分布偏移具有鲁棒性并能泛化至新领域的表征方面取得了显著成功。尽管实验成果斐然,但其学习这类可泛化表征的内在机制仍未被阐明。本研究对此问题进行了严谨分析,揭示了MMCL鲁棒性的两种机制:其一是**类内对比**,使模型能够学习高方差特征;其二是**类间特征共享**,即一个类别中的注释细节有助于更好地学习其他类别。这两种机制均能防止训练数据中过度表征的虚假特征压制可泛化的核心特征,从而在分布偏移下实现卓越的零样本分类准确率。此外,我们从理论上证明了富语义描述对鲁棒性的益处,并探讨了在描述中标注不同类型细节的影响。我们通过实验验证了理论发现,包括精心设计的合成实验以及在MSCOCO/Conceptual Captions上训练CLIP模型并在偏移版本的ImageNets上进行评估的实景实验。