Large multimodal models extend the impressive capabilities of large language models by integrating multimodal understanding abilities. However, it is not clear how they can emulate the general intelligence and reasoning ability of humans. As recognizing patterns and abstracting concepts are key to general intelligence, we introduce PuzzleVQA, a collection of puzzles based on abstract patterns. With this dataset, we evaluate large multimodal models with abstract patterns based on fundamental concepts, including colors, numbers, sizes, and shapes. Through our experiments on state-of-the-art large multimodal models, we find that they are not able to generalize well to simple abstract patterns. Notably, even GPT-4V cannot solve more than half of the puzzles. To diagnose the reasoning challenges in large multimodal models, we progressively guide the models with our ground truth reasoning explanations for visual perception, inductive reasoning, and deductive reasoning. Our systematic analysis finds that the main bottlenecks of GPT-4V are weaker visual perception and inductive reasoning abilities. Through this work, we hope to shed light on the limitations of large multimodal models and how they can better emulate human cognitive processes in the future (Our data and code will be released publicly at https://github.com/declare-lab/LLM-PuzzleTest).
翻译:大型多模态模型通过整合多模态理解能力,扩展了大型语言模型的卓越性能。然而,这些模型如何模拟人类通用智能与推理能力尚不明确。由于模式识别与概念抽象是通用智能的关键,我们提出了PuzzleVQA——一个基于抽象模式的谜题数据集。通过该数据集,我们基于颜色、数字、大小和形状等基本概念,评估了大型多模态模型对抽象模式的理解能力。针对当前最先进的大型多模态模型的实验表明,它们无法良好地泛化至简单抽象模式。值得注意的是,即便是GPT-4V也无法解决超过半数的谜题。为诊断大型多模态模型的推理瓶颈,我们通过提供包含视觉感知、归纳推理与演绎推理的真实推理解释,逐步引导模型进行分析。系统分析发现,GPT-4V的主要瓶颈在于较弱的视觉感知与归纳推理能力。通过本研究,我们期望揭示大型多模态模型的局限性,并探讨其未来如何更优地模拟人类认知过程(我们的数据与代码将公开发布于https://github.com/declare-lab/LLM-PuzzleTest)。