Vision-Language Models (VLMs) such as GPT-4V have recently demonstrated incredible strides on diverse vision language tasks. We dig into vision-based deductive reasoning, a more sophisticated but less explored realm, and find previously unexposed blindspots in the current SOTA VLMs. Specifically, we leverage Raven's Progressive Matrices (RPMs), to assess VLMs' abilities to perform multi-hop relational and deductive reasoning relying solely on visual clues. We perform comprehensive evaluations of several popular VLMs employing standard strategies such as in-context learning, self-consistency, and Chain-of-thoughts (CoT) on three diverse datasets, including the Mensa IQ test, IntelligenceTest, and RAVEN. The results reveal that despite the impressive capabilities of LLMs in text-based reasoning, we are still far from achieving comparable proficiency in visual deductive reasoning. We found that certain standard strategies that are effective when applied to LLMs do not seamlessly translate to the challenges presented by visual reasoning tasks. Moreover, a detailed analysis reveals that VLMs struggle to solve these tasks mainly because they are unable to perceive and comprehend multiple, confounding abstract patterns in RPM examples.
翻译:视觉语言模型(如GPT-4V)近期在各种视觉语言任务中展现出惊人的进步。我们深入探究了更具复杂性但探索较少的基于视觉的演绎推理领域,并发现了当前最先进的视觉语言模型中此前未暴露的盲点。具体而言,我们利用瑞文推理矩阵来评估仅依赖视觉线索时视觉语言模型执行多跳关系推理和演绎推理的能力。我们在三个不同数据集(包括门萨智商测试、智力测试和RAVEN)上,采用上下文学习、自一致性思维和思维链等标准策略,对多种流行视觉语言模型进行了全面评估。结果表明,尽管大语言模型在基于文本的推理方面能力卓越,但我们在视觉演绎推理方面仍远未达到同等水平。我们发现,某些在大语言模型上有效的标准策略并不能无缝迁移到视觉推理任务带来的挑战中。此外,详细分析揭示,视觉语言模型难以解决这些任务的主要原因是它们无法感知和理解瑞文推理矩阵示例中多个混杂的抽象模式。