Multi-modal contrastive learning (MMCL) has recently garnered considerable interest due to its superior performance in visual tasks, achieved by embedding multi-modal data, such as visual-language pairs. However, there still lack theoretical understandings of how MMCL extracts useful visual representation from multi-modal pairs, and particularly, how MMCL outperforms previous approaches like self-supervised contrastive learning (SSCL). In this paper, by drawing an intrinsic connection between MMCL and asymmetric matrix factorization, we establish the first generalization guarantees of MMCL for visual downstream tasks. Based on this framework, we further unify MMCL and SSCL by showing that MMCL implicitly performs SSCL with (pseudo) positive pairs induced by text pairs. Through this unified perspective, we characterize the advantage of MMCL by showing that text pairs induce more semantically consistent and diverse positive pairs, which, according to our analysis, provably benefit downstream generalization. Inspired by this finding, we propose CLIP-guided resampling methods to significantly improve the downstream performance of SSCL on ImageNet by leveraging multi-modal information. Code is available at https://github.com/PKU-ML/CLIP-Help-SimCLR.
翻译:多模态对比学习(MMCL)通过嵌入视觉-语言对等多模态数据,在视觉任务中展现出卓越性能,近年来引发了广泛关注。然而,关于MMCL如何从多模态对中提取有效的视觉表征,特别是其如何优于自监督对比学习(SSCL)等先前方法,仍缺乏理论层面的理解。本文通过揭示MMCL与非对称矩阵分解的内在联系,首次建立了MMCL在视觉下游任务中的泛化性保证。基于该框架,我们进一步统一了MMCL与SSCL的理论体系,证明MMCL隐式地执行了由文本对诱导的(伪)正样本对的SSCL。通过这一统一视角,我们阐明了MMCL的优势:文本对能够生成语义更一致且更多样化的正样本对,根据我们的分析,这可证明地提升下游泛化性能。受此发现启发,我们提出CLIP引导的重采样方法,通过利用多模态信息显著提升SSCL在ImageNet上的下游性能。代码已开源至https://github.com/PKU-ML/CLIP-Help-SimCLR。