Language-supervised vision models have recently attracted great attention in computer vision. A common approach to build such models is to use contrastive learning on paired data across the two modalities, as exemplified by Contrastive Language-Image Pre-Training (CLIP). In this paper, under linear representation settings, (i) we initiate the investigation of a general class of nonlinear loss functions for multimodal contrastive learning (MMCL) including CLIP loss and show its connection to singular value decomposition (SVD). Namely, we show that each step of loss minimization by gradient descent can be seen as performing SVD on a contrastive cross-covariance matrix. Based on this insight, (ii) we analyze the performance of MMCL. We quantitatively show that the feature learning ability of MMCL can be better than that of unimodal contrastive learning applied to each modality even under the presence of wrongly matched pairs. This characterizes the robustness of MMCL to noisy data. Furthermore, when we have access to additional unpaired data, (iii) we propose a new MMCL loss that incorporates additional unpaired datasets. We show that the algorithm can detect the ground-truth pairs and improve performance by fully exploiting unpaired datasets. The performance of the proposed algorithm was verified by numerical experiments.
翻译:语言监督的视觉模型近年来在计算机视觉领域引起了广泛关注。构建此类模型的常用方法是在跨模态配对数据上应用对比学习,如对比语言-图像预训练(CLIP)模型所示。本文在线性表示设定下,(i) 首次研究了一类包含CLIP损失在内的多模态对比学习(MMCL)非线性损失函数的通用性质,并揭示了其与奇异值分解(SVD)的内在联系。具体而言,我们证明通过梯度下降进行损失最小化的每一步均可视为对对比交叉协方差矩阵执行SVD操作。基于这一洞见,(ii) 我们分析了MMCL的性能表现。定量结果表明,即便存在错误匹配对的情况下,MMCL的特征学习能力仍可能优于对单个模态分别应用的单模态对比学习,这刻画了MMCL对噪声数据的鲁棒性。此外,当额外无配对数据可用时,(iii) 我们提出了一种能够融合额外无配对数据集的新型MMCL损失函数。研究表明,该算法能够检测真实配对关系并通过充分利用无配对数据集提升性能。数值实验验证了所提算法的有效性。