Vision-language models (VLMs) mainly rely on contrastive training to learn general-purpose representations of images and captions. We focus on the situation when one image is associated with several captions, each caption containing both information shared among all captions and unique information per caption about the scene depicted in the image. In such cases, it is unclear whether contrastive losses are sufficient for learning task-optimal representations that contain all the information provided by the captions or whether the contrastive learning setup encourages the learning of a simple shortcut that minimizes contrastive loss. We introduce synthetic shortcuts for vision-language: a training and evaluation framework where we inject synthetic shortcuts into image-text data. We show that contrastive VLMs trained from scratch or fine-tuned with data containing these synthetic shortcuts mainly learn features that represent the shortcut. Hence, contrastive losses are not sufficient to learn task-optimal representations, i.e., representations that contain all task-relevant information shared between the image and associated captions. We examine two methods to reduce shortcut learning in our training and evaluation framework: (i) latent target decoding and (ii) implicit feature modification. We show empirically that both methods improve performance on the evaluation task, but only partly reduce shortcut learning when training and evaluating with our shortcut learning framework. Hence, we show the difficulty and challenge of our shortcut learning framework for contrastive vision-language representation learning.
翻译:视觉-语言模型(VLMs)主要依赖对比训练来学习图像和文本描述的通用表征。我们关注当一张图像关联多个文本描述时的情况,其中每个描述既包含所有描述共有的信息,也包含该描述特有的、关于图像所描绘场景的信息。在此类情况下,尚不清楚对比损失是否足以学习包含所有文本描述所提供信息的任务最优表征,抑或对比学习设置会鼓励模型学习能最小化对比损失的简单捷径。我们为视觉-语言任务引入合成捷径:一个向图文数据中注入合成捷径的训练与评估框架。实验表明,使用包含此类合成捷径的数据从头训练或微调的对比式VLMs,主要学习表征捷径的特征。因此,对比损失不足以学习任务最优表征——即包含图像与关联文本描述之间所有任务相关信息的表征。我们在训练与评估框架中检验了两种减少捷径学习的方法:(i)潜在目标解码与(ii)隐式特征修正。实证研究表明,两种方法均能提升评估任务的性能,但在使用我们的捷径学习框架进行训练和评估时,仅能部分减少捷径学习。由此,我们揭示了该捷径学习框架对对比式视觉-语言表征学习提出的困难与挑战。