Vision-Language Models (VLMs) such as GPT-4V have recently demonstrated incredible strides on diverse vision language tasks. We dig into vision-based deductive reasoning, a more sophisticated but less explored realm, and find previously unexposed blindspots in the current SOTA VLMs. Specifically, we leverage Raven's Progressive Matrices (RPMs), to assess VLMs' abilities to perform multi-hop relational and deductive reasoning relying solely on visual clues. We perform comprehensive evaluations of several popular VLMs employing standard strategies such as in-context learning, self-consistency, and Chain-of-thoughts (CoT) on three diverse datasets, including the Mensa IQ test, IntelligenceTest, and RAVEN. The results reveal that despite the impressive capabilities of LLMs in text-based reasoning, we are still far from achieving comparable proficiency in visual deductive reasoning. We found that certain standard strategies that are effective when applied to LLMs do not seamlessly translate to the challenges presented by visual reasoning tasks. Moreover, a detailed analysis reveals that VLMs struggle to solve these tasks mainly because they are unable to perceive and comprehend multiple, confounding abstract patterns in RPM examples.
翻译:视觉语言模型(如GPT-4V)近期在多样化的视觉语言任务上展现了令人瞩目的进展。我们深入探索了基于视觉的演绎推理这一更复杂但较少被研究的领域,并发现当前最先进的视觉语言模型中此前未被揭示的盲点。具体而言,我们利用瑞文渐进矩阵来评估视觉语言模型仅依赖视觉线索执行多跳关系推理和演绎推理的能力。我们在三个不同数据集(包括门萨智商测试、智力测试和RAVEN)上,采用上下文学习、自一致性、思维链等标准策略,对多个主流视觉语言模型进行了全面评估。结果表明,尽管大语言模型在基于文本的推理中展现出惊人能力,但在视觉演绎推理方面,我们仍远未达到同等熟练程度。我们发现,某些对大语言模型有效的标准策略并不能无缝迁移到视觉推理任务带来的挑战中。此外,详细分析揭示,视觉语言模型难以解决这些任务的主要原因是它们无法感知并理解瑞文渐进矩阵示例中的多个混淆抽象模式。