Vision-language models (VLMs) are impactful in part because they can be applied to a variety of visual understanding tasks in a zero-shot fashion, without any fine-tuning. We study $\textit{generative VLMs}$ that are trained for next-word generation given an image. We explore their zero-shot performance on the illustrative task of image-text retrieval across 8 popular vision-language benchmarks. Our first observation is that they can be repurposed for discriminative tasks (such as image-text retrieval) by simply computing the match score of generating a particular text string given an image. We call this probabilistic score the $\textit{Visual Generative Pre-Training Score}$ (VisualGPTScore). While the VisualGPTScore produces near-perfect accuracy on some retrieval benchmarks, it yields poor accuracy on others. We analyze this behavior through a probabilistic lens, pointing out that some benchmarks inadvertently capture unnatural language distributions by creating adversarial but unlikely text captions. In fact, we demonstrate that even a "blind" language model that ignores any image evidence can sometimes outperform all prior art, reminiscent of similar challenges faced by the visual-question answering (VQA) community many years ago. We derive a probabilistic post-processing scheme that controls for the amount of linguistic bias in generative VLMs at test time without having to retrain or fine-tune the model. We show that the VisualGPTScore, when appropriately debiased, is a strong zero-shot baseline for vision-language understanding, oftentimes producing state-of-the-art accuracy.
翻译:视觉-语言模型(VLM)之所以影响深远,部分原因在于它们能够以零样本方式应用于多种视觉理解任务,而无需任何微调。我们研究的是针对给定图像进行下一个词生成训练的$\textit{生成式视觉-语言模型}$。我们在8个流行的视觉-语言基准上探索了它们在图像-文本检索这一示例性任务中的零样本性能。我们的第一个观察结果是,只需计算给定图像时生成特定文本字符串的匹配分数,便可将这些模型重新用于判别性任务(如图像-文本检索)。我们将该概率分数称为$\textit{视觉生成式预训练分数}$(VisualGPTScore)。尽管VisualGPTScore在某些检索基准上取得了近乎完美的准确率,但在其他基准上表现不佳。我们通过概率视角分析这一行为,指出某些基准通过创建对抗性但可能性较低的文本描述,无意中捕捉到了非自然的语言分布。事实上,我们证明,即使是一个忽略任何图像证据的“盲”语言模型,有时也能超越所有先前方法,这让人联想到多年前视觉问答(VQA)社区面临的类似挑战。我们推导出一种概率后处理方案,在测试时控制生成式VLM中的语言偏差程度,而无需重新训练或微调模型。我们证明,经过适当去偏后的VisualGPTScore是视觉-语言理解的一个强零样本基线,常常能达到最先进的准确率。