Visual Generation Unlocks Human-Like Reasoning through Multimodal World Models

Humans construct internal world models and reason by manipulating the concepts within these models. Recent advances in AI, particularly chain-of-thought (CoT) reasoning, approximate such human cognitive abilities, where world models are believed to be embedded within large language models. Expert-level performance in formal and abstract domains such as mathematics and programming has been achieved in current systems by relying predominantly on verbal reasoning. However, they still lag far behind humans in domains like physical and spatial intelligence, which require richer representations and prior knowledge. The emergence of unified multimodal models (UMMs) capable of both verbal and visual generation has therefore sparked interest in more human-like reasoning grounded in complementary multimodal pathways, though their benefits remain unclear. From a world-model perspective, this paper presents the first principled study of when and how visual generation benefits reasoning. Our key position is the visual superiority hypothesis: for certain tasks--particularly those grounded in the physical world--visual generation more naturally serves as world models, whereas purely verbal world models encounter bottlenecks arising from representational limitations or insufficient prior knowledge. Theoretically, we formalize internal world modeling as a core component of CoT reasoning and analyze distinctions among different forms of world models. Empirically, we identify tasks that necessitate interleaved visual-verbal CoT reasoning, constructing a new evaluation suite, VisWorld-Eval. Controlled experiments on a state-of-the-art UMM show that interleaved CoT significantly outperforms purely verbal CoT on tasks that favor visual world modeling, but offers no clear advantage otherwise. Together, this work clarifies the potential of multimodal world modeling for more powerful, human-like multimodal AI.

翻译：人类通过构建内部世界模型并操纵其中的概念进行推理。人工智能的最新进展，特别是思维链（CoT）推理，近似实现了这种人类认知能力，其中世界模型被认为嵌入于大型语言模型中。当前系统主要依赖言语推理，已在数学和编程等正式抽象领域达到专家级性能。然而，在物理和空间智能等需要更丰富表征与先验知识的领域，它们仍远落后于人类。兼具言语与视觉生成能力的统一多模态模型（UMMs）的出现，激发了人们对基于互补多模态通路的类人推理的兴趣，但其优势尚不明确。本文从世界模型视角出发，首次系统性地研究了视觉生成在何时以及如何有益于推理。我们的核心观点是视觉优势假说：对于某些任务——特别是那些基于物理世界的任务——视觉生成更自然地充当世界模型，而纯言语世界模型则会遇到表征局限或先验知识不足导致的瓶颈。在理论上，我们将内部世界建模形式化为CoT推理的核心组成部分，并分析了不同形式世界模型之间的区别。在实证上，我们识别出需要交错式视觉-言语CoT推理的任务，构建了新的评估套件VisWorld-Eval。在先进UMM上的受控实验表明，交错式CoT在适合视觉世界建模的任务上显著优于纯言语CoT，而在其他任务上则无明显优势。综上，本研究阐明了多模态世界模型在构建更强大、更类人的多模态人工智能方面的潜力。