Mental visualization, the ability to construct and manipulate visual representations internally, is a core component of human cognition and plays a vital role in tasks involving reasoning, prediction, and abstraction. Despite the rapid progress of Multimodal Large Language Models (MLLMs), current benchmarks primarily assess passive visual perception, offering limited insight into the more active capability of internally constructing visual patterns to support problem solving. Yet mental visualization is a critical cognitive skill in humans, supporting abilities such as spatial navigation, predicting physical trajectories, and solving complex visual problems through imaginative simulation. To bridge this gap, we introduce Hyperphantasia, a synthetic benchmark designed to evaluate the mental visualization abilities of MLLMs through four carefully constructed puzzles. Each puzzle is procedurally generated and presented at three difficulty levels, enabling controlled analysis of model performance across increasing complexity. Our comprehensive evaluation of state-of-the-art models reveals a substantial gap between the performance of humans and MLLMs. Additionally, we explore the potential of reinforcement learning to improve visual simulation capabilities. Our findings suggest that while some models exhibit partial competence in recognizing visual patterns, robust mental visualization remains an open challenge for current MLLMs.
翻译:心理可视化,即在内部构建和操作视觉表征的能力,是人类认知的核心组成部分,在涉及推理、预测和抽象的任务中起着至关重要的作用。尽管多模态大语言模型(MLLMs)取得了快速进展,但当前的基准主要评估被动的视觉感知能力,对于内部构建视觉模式以支持问题解决的更主动能力提供的见解有限。然而,心理可视化是人类的一项关键认知技能,支持诸如空间导航、预测物理轨迹以及通过想象模拟解决复杂视觉问题等能力。为了弥合这一差距,我们引入了Hyperphantasia,这是一个通过四个精心构建的谜题来评估MLLMs心理可视化能力的合成基准。每个谜题都是程序化生成的,并以三个难度级别呈现,从而能够对模型在日益增加的复杂性下的性能进行受控分析。我们对最先进模型的全面评估揭示了人类与MLLMs性能之间的巨大差距。此外,我们探索了强化学习在提升视觉模拟能力方面的潜力。我们的研究结果表明,虽然某些模型在识别视觉模式方面表现出部分能力,但稳健的心理可视化仍然是当前MLLMs面临的一个开放挑战。