On the Importance of Contrastive Loss in Multimodal Learning

Recently, contrastive learning approaches (e.g., CLIP (Radford et al., 2021)) have received huge success in multimodal learning, where the model tries to minimize the distance between the representations of different views (e.g., image and its caption) of the same data point while keeping the representations of different data points away from each other. However, from a theoretical perspective, it is unclear how contrastive learning can learn the representations from different views efficiently, especially when the data is not isotropic. In this work, we analyze the training dynamics of a simple multimodal contrastive learning model and show that contrastive pairs are important for the model to efficiently balance the learned representations. In particular, we show that the positive pairs will drive the model to align the representations at the cost of increasing the condition number, while the negative pairs will reduce the condition number, keeping the learned representations balanced.

翻译：近期，对比学习方法（如CLIP (Radford et al., 2021)）在多模态学习中取得了巨大成功，该模型试图最小化同一数据点不同视图（例如图像及其描述）表征之间的距离，同时保持不同数据点的表征彼此远离。然而，从理论角度来看，对比学习如何高效地从不同视图中学习表征尚不明确，尤其是在数据非各向同性的情况下。在本工作中，我们分析了一个简单多模态对比学习模型的训练动态，并表明对比对对于模型高效平衡所学表征至关重要。具体而言，我们证明正对会驱使模型以增加条件数为代价对齐表征，而负对则会降低条件数，从而保持所学表征的平衡。

相关内容

多模态学习

关注 44

现实世界中的信息通常以不同的模态出现。例如，图像通常与标签和文本解释联系在一起;文本包含图像以便更清楚地表达文章的主要思想。不同的模态由迥异的统计特性刻画。例如，图像通常表示为特征提取器的像素强度或输出，而文本则表示为离散的词向量。由于不同信息资源的统计特性不同，发现不同模态之间的关系是非常重要的。多模态学习是一个很好的模型，可以用来表示不同模态的联合表示。多模态学习模型也能在观察到的情况下填补缺失的模态。多模态学习模型中，每个模态对应结合了两个深度玻尔兹曼机（deep boltzmann machines）.另外一个隐藏层被放置在两个玻尔兹曼机上层，以给出联合表示。

【CVPR2022】通过初始阶段的表征去相关性来提升类增量学习

专知会员服务

18+阅读 · 2022年4月25日

【CVPR2022】视频对比学习的概率表示，Probabilistic Representations for Video Contrastive Learning

专知会员服务

16+阅读 · 2022年4月11日

【ICLR2022】Transformers亦能贝叶斯推断

专知会员服务

25+阅读 · 2021年12月23日

【ICCV2021】参数化对比学习

专知会员服务

33+阅读 · 2021年7月27日