Frontier models are transitioning from multimodal large language models (MLLMs) that merely ingest visual information to unified multimodal models (UMMs) capable of native interleaved generation. This shift has sparked interest in using intermediate visualizations as a reasoning aid, akin to human mental imagery. Central to this idea is the ability to form, maintain, and manipulate visual representations in a goal-oriented manner. To evaluate and probe this capability, we develop MentisOculi, a procedural, stratified suite of multi-step reasoning problems amenable to visual solution, tuned to challenge frontier models. Evaluating visual strategies ranging from latent tokens to explicit generated imagery, we find they generally fail to improve performance. Analysis of UMMs specifically exposes a critical limitation: While they possess the textual reasoning capacity to solve a task and can sometimes generate correct visuals, they suffer from compounding generation errors and fail to leverage even ground-truth visualizations. Our findings suggest that despite their inherent appeal, visual thoughts do not yet benefit model reasoning. MentisOculi establishes the necessary foundation to analyze and close this gap across diverse model families.
翻译:前沿模型正从仅能被动接收视觉信息的多模态大语言模型(MLLMs)向具备原生交错生成能力的统一多模态模型(UMMs)演进。这一转变引发了将中间可视化作为推理辅助(类似于人类的心理意象)的研究兴趣。该思想的核心在于以目标导向的方式形成、维持并操控视觉表征。为评估和探查这一能力,我们开发了MentisOculi:一套程序化、分层递进的、适用于视觉求解的多步推理问题集,专门用于挑战前沿模型。通过评估从隐式潜在表征到显式生成图像等各类视觉策略,我们发现这些策略普遍未能提升模型性能。对UMMs的深入分析揭示了一个关键局限:尽管模型具备解决任务所需的文本推理能力,且有时能生成正确的可视化结果,但它们受困于复合生成误差,甚至无法有效利用真实可视化数据。我们的研究表明,视觉思维虽具有天然的吸引力,但尚未能实际提升模型推理能力。MentisOculi为跨模型族系分析与弥合这一差距奠定了必要基础。