Vision-language models (VLMs) mainly rely on contrastive training to learn general-purpose representations of images and captions. We focus on the situation when one image is associated with several captions, each caption containing both information shared among all captions and unique information per caption about the scene depicted in the image. In such cases, it is unclear whether contrastive losses are sufficient for learning task-optimal representations that contain all the information provided by the captions or whether the contrastive learning setup encourages the learning of a simple shortcut that minimizes contrastive loss. We introduce synthetic shortcuts for vision-language: a training and evaluation framework where we inject synthetic shortcuts into image-text data. We show that contrastive VLMs trained from scratch or fine-tuned with data containing these synthetic shortcuts mainly learn features that represent the shortcut. Hence, contrastive losses are not sufficient to learn task-optimal representations, i.e., representations that contain all task-relevant information shared between the image and associated captions. We examine two methods to reduce shortcut learning in our training and evaluation framework: (i) latent target decoding and (ii) implicit feature modification. We show empirically that both methods improve performance on the evaluation task, but only partly reduce shortcut learning when training and evaluating with our shortcut learning framework. Hence, we show the difficulty and challenge of our shortcut learning framework for contrastive vision-language representation learning.
翻译:视觉-语言模型主要依赖对比训练来学习图像与描述文本的通用表征。我们聚焦于以下场景:当一张图像关联多个描述文本时,每个文本既包含所有描述共有的信息,也包含该图像场景的独有信息。在此类情况下,尚不明确对比损失是否足以学习包含描述文本所有信息的任务最优表征,亦或对比学习机制会促使模型习得仅最小化对比损失的简单捷径。为此,我们提出视觉-语言合成捷径方法:通过向图像-文本数据中注入合成捷径,构建训练与评估框架。实验表明,从头训练或使用含合成捷径数据微调的对比视觉-语言模型,其习得特征主要表征该捷径。因此,对比损失不足以保证学习任务最优表征,即包含图像与关联描述文本间所有任务相关信息的表征。我们在训练与评估框架中探讨两种减少捷径学习的方法:(一)潜在目标解码,(二)隐式特征修改。实证结果显示,两种方法均能提升评估任务性能,但在基于捷径学习框架的训练与评估中,仅能部分减少捷径学习现象。由此揭示了对比视觉-语言表征学习中捷径学习框架的难度与挑战。