Frontier models are transitioning from multimodal large language models (MLLMs) that merely ingest visual information to unified multimodal models (UMMs) capable of native interleaved generation. This shift has sparked interest in using intermediate visualizations as a reasoning aid, akin to human mental imagery. Central to this idea is the ability to form, maintain, and manipulate visual representations in a goal-oriented manner. To evaluate and probe this capability, we develop MentisOculi, a procedural, stratified suite of multi-step reasoning problems amenable to visual solution, tuned to challenge frontier models. Evaluating visual strategies ranging from latent tokens to explicit generated imagery, we find they generally fail to improve performance. Analysis of UMMs specifically exposes a critical limitation: While they possess the textual reasoning capacity to solve a task and can sometimes generate correct visuals, they suffer from compounding generation errors and fail to leverage even ground-truth visualizations. Our findings suggest that despite their inherent appeal, visual thoughts do not yet benefit model reasoning. MentisOculi establishes the necessary foundation to analyze and close this gap across diverse model families.
翻译:前沿模型正从仅能接收视觉信息的多模态大语言模型(MLLMs)向能够原生交错生成的统一多模态模型(UMMs)过渡。这一转变激发了人们利用中间可视化作为推理辅助工具的兴趣,类似于人类的心理意象。此概念的核心在于以目标导向的方式形成、保持和操纵视觉表征的能力。为了评估和探究这种能力,我们开发了MentisOculi——一个程序化、分层的多步骤推理问题套件,其适用于视觉化解决方案,并经过调整以挑战前沿模型。通过评估从潜在标记到显式生成图像的各种视觉策略,我们发现它们通常无法提升性能。对UMMs的具体分析揭示了一个关键局限:尽管它们具备解决任务的文本推理能力,有时也能生成正确的视觉内容,但它们受到生成误差累积的影响,并且无法有效利用即使是真实的可视化结果。我们的研究结果表明,尽管视觉思维具有内在吸引力,但目前尚未有益于模型的推理能力。MentisOculi为分析和弥合不同模型家族间的这一差距奠定了必要的基础。