Vision-language models (VLMs) are impactful in part because they can be applied to a variety of visual understanding tasks in a zero-shot fashion, without any fine-tuning. We study $\textit{generative VLMs}$ that are trained for next-word generation given an image. We explore their zero-shot performance on the illustrative task of image-text retrieval across 8 popular vision-language benchmarks. Our first observation is that they can be repurposed for discriminative tasks (such as image-text retrieval) by simply computing the match score of generating a particular text string given an image. We call this probabilistic score the $\textit{Visual Generative Pre-Training Score}$ (VisualGPTScore). While the VisualGPTScore produces near-perfect accuracy on some retrieval benchmarks, it yields poor accuracy on others. We analyze this behavior through a probabilistic lens, pointing out that some benchmarks inadvertently capture unnatural language distributions by creating adversarial but unlikely text captions. In fact, we demonstrate that even a "blind" language model that ignores any image evidence can sometimes outperform all prior art, reminiscent of similar challenges faced by the visual-question answering (VQA) community many years ago. We derive a probabilistic post-processing scheme that controls for the amount of linguistic bias in generative VLMs at test time without having to retrain or fine-tune the model. We show that the VisualGPTScore, when appropriately debiased, is a strong zero-shot baseline for vision-language understanding, oftentimes producing state-of-the-art accuracy.
翻译:视觉-语言模型(VLM)因其能够以零样本方式(无需微调)应用于多种视觉理解任务而具有重要影响。我们研究经过训练的生成式VLM,其目标是根据给定图像生成下一个单词。我们探索了这些模型在图像-文本检索这一说明性任务上的零样本性能,涉及8个流行的视觉-语言基准。首先观察到,通过计算给定图像生成特定文本字符串的匹配分数,可以将其重新用于判别任务(如图像-文本检索)。我们将这一概率分数称为视觉生成式预训练分数(VisualGPTScore)。虽然VisualGPTScore在某些检索基准上实现了近乎完美的准确率,但在其他基准上则表现不佳。我们通过概率视角分析这一行为,指出某些基准通过创建对抗性但不太可能的文本描述,无意中捕捉了非自然的语言分布。事实上,我们证明,甚至忽略任何图像证据的“盲”语言模型有时也能超越先前所有方法,这让人联想到多年前视觉问答(VQA)社区面临的类似挑战。我们推导出一种概率后处理方案,能够在测试时控制生成式VLM中的语言偏差量,而无需重新训练或微调模型。我们表明,经过适当去偏的VisualGPTScore是视觉-语言理解的一个强大零样本基线,通常能产生最先进的准确率。