Cross-modal contrastive learning in vision language pretraining (VLP) faces the challenge of (partial) false negatives. In this paper, we study this problem from the perspective of Mutual Information (MI) optimization. It is common sense that InfoNCE loss used in contrastive learning will maximize the lower bound of MI between anchors and their positives, while we theoretically prove that MI involving negatives also matters when noises commonly exist. Guided by a more general lower bound form for optimization, we propose a contrastive learning strategy regulated by progressively refined cross-modal similarity, to more accurately optimize MI between an image/text anchor and its negative texts/images instead of improperly minimizing it. Our method performs competitively on four downstream cross-modal tasks and systematically balances the beneficial and harmful effects of (partial) false negative samples under theoretical guidance.
翻译:跨模态对比学习在视觉语言预训练(VLP)中面临(部分)假负样本的挑战。本文从互信息(MI)优化的角度研究这一问题。通常认为,对比学习中的InfoNCE损失会最大化锚点与正样本之间的互信息下界,而我们理论证明,当噪声普遍存在时,涉及负样本的互信息同样重要。受更一般化优化下界形式的指导,本文提出一种由渐进精细化的跨模态相似性调节的对比学习策略,以更精确地优化图像/文本锚点与其负文本/负图像之间的互信息,而非不当最小化该互信息。该方法在四项跨模态下游任务中表现优异,并在理论指导下系统平衡了(部分)假负样本的有益与有害影响。