The advent of Vision Language Models (VLM) has allowed researchers to investigate the visual understanding of a neural network using natural language. Beyond object classification and detection, VLMs are capable of visual comprehension and common-sense reasoning. This naturally led to the question: How do VLMs respond when the image itself is inherently unreasonable? To this end, we present IllusionVQA: a diverse dataset of challenging optical illusions and hard-to-interpret scenes to test the capability of VLMs in two distinct multiple-choice VQA tasks - comprehension and soft localization. GPT4V, the best-performing VLM, achieves 62.99% accuracy (4-shot) on the comprehension task and 49.7% on the localization task (4-shot and Chain-of-Thought). Human evaluation reveals that humans achieve 91.03% and 100% accuracy in comprehension and localization. We discover that In-Context Learning (ICL) and Chain-of-Thought reasoning substantially degrade the performance of GeminiPro on the localization task. Tangentially, we discover a potential weakness in the ICL capabilities of VLMs: they fail to locate optical illusions even when the correct answer is in the context window as a few-shot example.
翻译:视觉语言模型(VLM)的出现使研究者能够通过自然语言探究神经网络的视觉理解能力。超越目标分类与检测,VLM已具备视觉理解与常识推理能力。这自然引出一个问题:当图像本身存在内在不合理性时,VLM将如何响应?为此,我们提出IllusionVQA:一个包含多样化挑战性光学幻觉与难以解释场景的数据集,用于测试VLM在两项不同的多选VQA任务——理解与软定位——中的能力。表现最佳的VLM GPT4V在理解任务中达到62.99%的准确率(4-shot),在定位任务中达到49.7%的准确率(4-shot与思维链)。人类评估显示,人类在理解与定位任务中的准确率分别为91.03%和100%。我们发现,上下文学习(ICL)与思维链推理会显著降低GeminiPro在定位任务上的性能。此外,我们偶然发现VLM在ICL能力上的潜在弱点:即使正确答案作为少样本示例出现在上下文窗口中,它们仍无法定位光学幻觉。