Cross-modal contrastive learning in vision language pretraining (VLP) faces the challenge of (partial) false negatives. In this paper, we study this problem from the perspective of Mutual Information (MI) optimization. It is common sense that InfoNCE loss used in contrastive learning will maximize the lower bound of MI between anchors and their positives, while we theoretically prove that MI involving negatives also matters when noises commonly exist. Guided by a more general lower bound form for optimization, we propose a contrastive learning strategy regulated by progressively refined cross-modal similarity, to more accurately optimize MI between an image/text anchor and its negative texts/images instead of improperly minimizing it. Our method performs competitively on four downstream cross-modal tasks and systematically balances the beneficial and harmful effects of (partial) false negative samples under theoretical guidance.
翻译:跨模态对比学习在视觉语言预训练中面临(部分)假阴性样本的挑战。本文从互信息优化角度研究该问题。已知对比学习中的InfoNCE损失会最大化锚点与正样本之间互信息的下界,而我们从理论上证明,在噪声普遍存在的情况下,涉及负样本的互信息同样至关重要。基于一种更通用的优化下界形式,我们提出了一种由逐步精化的跨模态相似度调控的对比学习策略,能够更精确地优化图像/文本锚点与其负样本文本/图像之间的互信息,而非不当最小化该值。本方法在四个下游跨模态任务上表现优异,并在理论指导下系统性地平衡了(部分)假阴性样本的益害影响。