The advent of Vision Language Models (VLM) has allowed researchers to investigate the visual understanding of a neural network using natural language. Beyond object classification and detection, VLMs are capable of visual comprehension and common-sense reasoning. This naturally led to the question: How do VLMs respond when the image itself is inherently unreasonable? To this end, we present IllusionVQA: a diverse dataset of challenging optical illusions and hard-to-interpret scenes to test the capability of VLMs in two distinct multiple-choice VQA tasks - comprehension and soft localization. GPT4V, the best performing VLM, achieves 62.99% accuracy (4-shot) on the comprehension task and 49.7% on the localization task (4-shot and Chain-of-Thought). Human evaluation reveals that humans achieve 91.03% and 100% accuracy in comprehension and localization. We discover that In-Context Learning (ICL) and Chain-of-Thought reasoning substantially degrade the performance of Gemini-Pro in the localization task. Tangentially, we discover a potential weakness in the ICL capabilities of VLMs: they fail to locate optical illusions even when the correct answer is in the context window as a few-shot example.
翻译:视觉语言模型的出现使得研究者能够通过自然语言探究神经网络的视觉理解能力。除了物体分类与检测,VLM还具备视觉理解与常识推理能力。这自然引出一个问题:当图像本身存在内在不合理性时,VLM会如何响应?为此,我们提出IllusionVQA:一个包含多样化挑战性光学幻觉与难解场景的数据集,用于在两种不同的多项选择VQA任务(理解任务与软定位任务)中测试VLM的能力。性能最佳的VLM模型GPT4V在理解任务中达到62.99%准确率(4样本提示),在定位任务中达到49.7%准确率(4样本提示与思维链推理)。人类评估显示,人类在理解与定位任务中分别达到91.03%与100%的准确率。我们发现情境学习与思维链推理会显著降低Gemini-Pro在定位任务中的表现。相关地,我们发现了VLM在情境学习能力方面的一个潜在缺陷:即使正确答案作为少样本示例存在于上下文窗口中,模型仍无法定位光学幻觉。