Contrastive learning is a powerful self-supervised learning method, but we have a limited theoretical understanding of how it works and why it works. In this paper, we prove that contrastive learning with the standard InfoNCE loss is equivalent to spectral clustering on the similarity graph. Using this equivalence as the building block, we extend our analysis to the CLIP model and rigorously characterize how similar multi-modal objects are embedded together. Motivated by our theoretical insights, we introduce the Kernel-InfoNCE loss, incorporating mixtures of kernel functions that outperform the standard Gaussian kernel on several vision datasets. The code is available at https://github.com/yifanzhang-pro/Kernel-InfoNCE.
翻译:对比学习是一种强大的自监督学习方法,但我们对它的工作原理及其有效性仍缺乏理论层面的深入理解。本文证明,采用标准InfoNCE损失的对比学习等价于在相似图上进行谱聚类。以此等价关系为基础,我们将分析拓展至CLIP模型,并严格刻画了多模态对象如何被嵌入至相似表示空间。受理论洞见启发,我们提出了Kernel-InfoNCE损失函数,该函数融合了多种核函数,在多个视觉数据集上性能优于标准高斯核。代码已开源:https://github.com/yifanzhang-pro/Kernel-InfoNCE。