Revisiting the Role of Language Priors in Vision-Language Models

Vision-language models (VLMs) are impactful in part because they can be applied to a variety of visual understanding tasks in a zero-shot fashion, without any fine-tuning. We study $\textit{generative VLMs}$ that are trained for next-word generation given an image. We explore their zero-shot performance on the illustrative task of image-text retrieval across 8 popular vision-language benchmarks. Our first observation is that they can be repurposed for discriminative tasks (such as image-text retrieval) by simply computing the match score of generating a particular text string given an image. We call this probabilistic score the $\textit{Visual Generative Pre-Training Score}$ (VisualGPTScore). While the VisualGPTScore produces near-perfect accuracy on some retrieval benchmarks, it yields poor accuracy on others. We analyze this behavior through a probabilistic lens, pointing out that some benchmarks inadvertently capture unnatural language distributions by creating adversarial but unlikely text captions. In fact, we demonstrate that even a "blind" language model that ignores any image evidence can sometimes outperform all prior art, reminiscent of similar challenges faced by the visual-question answering (VQA) community many years ago. We derive a probabilistic post-processing scheme that controls for the amount of linguistic bias in generative VLMs at test time without having to retrain or fine-tune the model. We show that the VisualGPTScore, when appropriately debiased, is a strong zero-shot baseline for vision-language understanding, oftentimes producing state-of-the-art accuracy.

翻译：视觉-语言模型（VLM）之所以影响深远，部分原因在于它们能够以零样本方式应用于多种视觉理解任务，而无需任何微调。我们研究的是针对给定图像进行下一个词生成训练的$\textit{生成式视觉-语言模型}$。我们在8个流行的视觉-语言基准上探索了它们在图像-文本检索这一示例性任务中的零样本性能。我们的第一个观察结果是，只需计算给定图像时生成特定文本字符串的匹配分数，便可将这些模型重新用于判别性任务（如图像-文本检索）。我们将该概率分数称为$\textit{视觉生成式预训练分数}$（VisualGPTScore）。尽管VisualGPTScore在某些检索基准上取得了近乎完美的准确率，但在其他基准上表现不佳。我们通过概率视角分析这一行为，指出某些基准通过创建对抗性但可能性较低的文本描述，无意中捕捉到了非自然的语言分布。事实上，我们证明，即使是一个忽略任何图像证据的“盲”语言模型，有时也能超越所有先前方法，这让人联想到多年前视觉问答（VQA）社区面临的类似挑战。我们推导出一种概率后处理方案，在测试时控制生成式VLM中的语言偏差程度，而无需重新训练或微调模型。我们证明，经过适当去偏后的VisualGPTScore是视觉-语言理解的一个强零样本基线，常常能达到最先进的准确率。