Revisiting the Role of Language Priors in Vision-Language Models

Vision-language models (VLMs) are impactful in part because they can be applied to a variety of visual understanding tasks in a zero-shot fashion, without any fine-tuning. We study $\textit{generative VLMs}$ that are trained for next-word generation given an image. We explore their zero-shot performance on the illustrative task of image-text retrieval across 8 popular vision-language benchmarks. Our first observation is that they can be repurposed for discriminative tasks (such as image-text retrieval) by simply computing the match score of generating a particular text string given an image. We call this probabilistic score the $\textit{Visual Generative Pre-Training Score}$ (VisualGPTScore). While the VisualGPTScore produces near-perfect accuracy on some retrieval benchmarks, it yields poor accuracy on others. We analyze this behavior through a probabilistic lens, pointing out that some benchmarks inadvertently capture unnatural language distributions by creating adversarial but unlikely text captions. In fact, we demonstrate that even a "blind" language model that ignores any image evidence can sometimes outperform all prior art, reminiscent of similar challenges faced by the visual-question answering (VQA) community many years ago. We derive a probabilistic post-processing scheme that controls for the amount of linguistic bias in generative VLMs at test time without having to retrain or fine-tune the model. We show that the VisualGPTScore, when appropriately debiased, is a strong zero-shot baseline for vision-language understanding, oftentimes producing state-of-the-art accuracy.

翻译：视觉-语言模型（VLM）因其能够以零样本方式（无需微调）应用于多种视觉理解任务而具有重要影响。我们研究经过训练的生成式VLM，其目标是根据给定图像生成下一个单词。我们探索了这些模型在图像-文本检索这一说明性任务上的零样本性能，涉及8个流行的视觉-语言基准。首先观察到，通过计算给定图像生成特定文本字符串的匹配分数，可以将其重新用于判别任务（如图像-文本检索）。我们将这一概率分数称为视觉生成式预训练分数（VisualGPTScore）。虽然VisualGPTScore在某些检索基准上实现了近乎完美的准确率，但在其他基准上则表现不佳。我们通过概率视角分析这一行为，指出某些基准通过创建对抗性但不太可能的文本描述，无意中捕捉了非自然的语言分布。事实上，我们证明，甚至忽略任何图像证据的“盲”语言模型有时也能超越先前所有方法，这让人联想到多年前视觉问答（VQA）社区面临的类似挑战。我们推导出一种概率后处理方案，能够在测试时控制生成式VLM中的语言偏差量，而无需重新训练或微调模型。我们表明，经过适当去偏的VisualGPTScore是视觉-语言理解的一个强大零样本基线，通常能产生最先进的准确率。