In self-supervised contrastive learning, a widely-adopted objective function is InfoNCE, which uses the heuristic cosine similarity for the representation comparison, and is closely related to maximizing the Kullback-Leibler (KL)-based mutual information. In this paper, we aim at answering two intriguing questions: (1) Can we go beyond the KL-based objective? (2) Besides the popular cosine similarity, can we design a better similarity function? We provide answers to both questions by generalizing the KL-based mutual information to the $f$-Mutual Information in Contrastive Learning ($f$-MICL) using the $f$-divergences. To answer the first question, we provide a wide range of $f$-MICL objectives which share the nice properties of InfoNCE (e.g., alignment and uniformity), and meanwhile result in similar or even superior performance. For the second question, assuming that the joint feature distribution is proportional to the Gaussian kernel, we derive an $f$-Gaussian similarity with better interpretability and empirical performance. Finally, we identify close relationships between the $f$-MICL objective and several popular InfoNCE-based objectives. Using benchmark tasks from both vision and natural language, we empirically evaluate $f$-MICL with different $f$-divergences on various architectures (SimCLR, MoCo, and MoCo v3) and datasets. We observe that $f$-MICL generally outperforms the benchmarks and the best-performing $f$-divergence is task and dataset dependent.
翻译:在自监督对比学习中,广泛采用的目标函数是InfoNCE,它使用启发式的余弦相似度进行表示比较,并与基于Kullback-Leibler(KL)散度的互信息最大化密切相关。本文旨在回答两个引人深思的问题:(1)我们能否超越基于KL散度的目标函数?(2)除了流行的余弦相似度,能否设计出更好的相似度函数?通过将基于KL散度的互信息泛化为基于$f$-散度的对比学习中的$f$-互信息($f$-MICL),我们为这两个问题提供了答案。针对第一个问题,我们提出了一系列$f$-MICL目标函数,它们既保留了InfoNCE的良好性质(如对齐性和均匀性),又能取得相似甚至更优的性能。针对第二个问题,在假设联合特征分布与高斯核成正比的条件下,我们推导出具有更好可解释性和实证表现的$f$-高斯相似度。最后,我们揭示了$f$-MICL目标函数与多种流行的基于InfoNCE的目标函数之间的紧密联系。通过视觉和自然语言领域的基准任务,我们采用不同架构(SimCLR、MoCo和MoCo v3)和数据集对基于不同$f$-散度的$f$-MICL进行了实证评估。实验结果表明,$f$-MICL通常优于基准方法,且最优$f$-散度的选择依赖于具体任务和数据集。