Cross-modal contrastive learning in vision language pretraining (VLP) faces the challenge of (partial) false negatives. In this paper, we study this problem from the perspective of Mutual Information (MI) optimization. It is common sense that InfoNCE loss used in contrastive learning will maximize the lower bound of MI between anchors and their positives, while we theoretically prove that MI involving negatives also matters when noises commonly exist. Guided by a more general lower bound form for optimization, we propose a contrastive learning strategy regulated by progressively refined cross-modal similarity, to more accurately optimize MI between an image/text anchor and its negative texts/images instead of improperly minimizing it. Our method performs competitively on four downstream cross-modal tasks and systematically balances the beneficial and harmful effects of (partial) false negative samples under theoretical guidance.
翻译:视觉语言预训练中的跨模态对比学习面临(部分)假负例的挑战。本文从互信息优化的角度研究该问题。通常认为,对比学习所使用的InfoNCE损失会最大化锚点与正例之间的互信息下界,而我们理论证明,当噪声普遍存在时,涉及负例的互信息同样重要。在更一般的优化下界形式指导下,我们提出由渐进精化的跨模态相似性调控的对比学习策略,以更精确地优化图像/文本锚点与其负例文本/图像之间的互信息,而非不当最小化该信息。本方法在四项下游跨模态任务中表现出竞争力,并在理论指导下系统平衡了(部分)假负例样本的有益与有害效应。