Revisiting the Role of Language Priors in Vision-Language Models

Vision-language models (VLMs) are impactful in part because they can be applied to a variety of visual understanding tasks in a zero-shot fashion, without any fine-tuning. We study $\textit{generative VLMs}$ that are trained for next-word generation given an image. We explore their zero-shot performance on the illustrative task of image-text retrieval across 8 popular vision-language benchmarks. Our first observation is that they can be repurposed for discriminative tasks (such as image-text retrieval) by simply computing the match score of generating a particular text string given an image. We call this probabilistic score the $\textit{Visual Generative Pre-Training Score}$ (VisualGPTScore). While the VisualGPTScore produces near-perfect accuracy on some retrieval benchmarks, it yields poor accuracy on others. We analyze this behavior through a probabilistic lens, pointing out that some benchmarks inadvertently capture unnatural language distributions by creating adversarial but unlikely text captions. In fact, we demonstrate that even a "blind" language model that ignores any image evidence can sometimes outperform all prior art, reminiscent of similar challenges faced by the visual-question answering (VQA) community many years ago. We derive a probabilistic post-processing scheme that controls for the amount of linguistic bias in generative VLMs at test time without having to retrain or fine-tune the model. We show that the VisualGPTScore, when appropriately debiased, is a strong zero-shot baseline for vision-language understanding, oftentimes producing state-of-the-art accuracy.

翻译：视觉语言模型（VLM）之所以具有影响力，部分原因在于它们能够以零样本方式应用于各种视觉理解任务，而无需任何微调。我们研究了一种生成式视觉语言模型（$\textit{generative VLMs}$），该模型针对给定图像的下一个单词生成进行训练。我们探索了其在8个流行视觉语言基准的图像-文本检索示例任务上的零样本性能。我们的第一个观察结果是，这些模型可以通过简单计算在给定图像下生成特定文本字符串的匹配分数，重新用于判别任务（如图像-文本检索）。我们将这种概率分数称为$\textit{视觉生成预训练分数}$（VisualGPTScore）。尽管VisualGPTScore在某些检索基准上产生了近乎完美的准确率，但在其他基准上表现不佳。我们通过概率视角分析这一行为，指出某些基准通过创建对抗性但不太可能的文本描述，无意中捕获了非自然的语言分布。事实上，我们证明即使是一个忽略任何图像证据的“盲目”语言模型，有时也能超越所有先前技术，这让人联想到多年前视觉问答（VQA）领域面临的类似挑战。我们推导出一种概率后处理方案，在测试时控制生成式视觉语言模型中的语言偏差量，而无需重新训练或微调模型。我们表明，经过适当去偏后的VisualGPTScore是视觉语言理解的一个强大零样本基线，常常能产生最先进的准确率。