Vision-Language Models (VLMs) have recently demonstrated incredible strides on diverse vision language tasks. We dig into vision-based deductive reasoning, a more sophisticated but less explored realm, and find previously unexposed blindspots in the current SOTA VLMs. Specifically, we leverage Raven's Progressive Matrices (RPMs), to assess VLMs' abilities to perform multi-hop relational and deductive reasoning relying solely on visual clues. We perform comprehensive evaluations of several popular VLMs employing standard strategies such as in-context learning, self-consistency, and Chain-of-thoughts (CoT) on three diverse datasets, including the Mensa IQ test, IntelligenceTest, and RAVEN. The results reveal that despite the impressive capabilities of LLMs in text-based reasoning, we are still far from achieving comparable proficiency in visual deductive reasoning. We found that certain standard strategies that are effective when applied to LLMs do not seamlessly translate to the challenges presented by visual reasoning tasks. A detailed analysis reveals that VLMs struggle to solve these tasks mainly because they are unable to perceive and comprehend multiple, confounding abstract patterns in RPM examples.
翻译:视觉-语言模型(VLMs)近期在多种视觉语言任务上展现出惊人的进步。我们深入探究基于视觉的演绎推理这一更为复杂但较少被探索的领域,发现了当前SOTA VLMs中先前未暴露的盲点。具体而言,我们利用瑞文渐进矩阵(RPMs)来评估VLMs仅依靠视觉线索进行多跳关系和演绎推理的能力。我们对多种流行VLM进行了全面评估,采用了包括上下文学习、自洽性和思维链(CoT)在内的标准策略,并在三个多样化数据集(包括门萨智商测试、IntelligenceTest和RAVEN)上进行测试。结果表明,尽管LLMs在基于文本的推理方面能力令人印象深刻,但我们在视觉演绎推理方面仍远未达到可比拟的熟练程度。我们发现,某些对LLMs有效的标准策略并不能无缝迁移到视觉推理任务带来的挑战中。详细分析表明,VLM难以解决这些任务的主要原因在于其无法感知和理解RPM示例中多个相互混淆的抽象模式。