Existing multimodal large language models have achieved high-fidelity visual perception and exploratory visual generation. However, a precision paradox persists in complex reasoning tasks: optical perception systems transcribe symbols without capturing logical topology, while pixel-based generative models produce visual artifacts lacking mathematical exactness. To bridge this gap, we propose that reasoning over visual inputs be reconceptualized as optical decompression-the process of reconstructing latent logical structures from compressed visual tokens. Guided by the axiom that Parsing is Reasoning, we introduce Thinking with Drafting (TwD), which utilizes a minimalist Domain-Specific Language (DSL) as a grounding intermediate representation. Unlike standard approaches that hallucinate answers directly, TwD forces the model to draft its mental model into executable code, rendering deterministic visual proofs for self-verification. To validate this, we present VisAlg, a visual algebra benchmark. Experiments demonstrate that TwD serve as a superior cognitive scaffold. Our work establishes a closed-loop system where visual generation acts not as a creative output but as a logical verifier, offering a generalizable path for visual reasoning.
翻译:现有的多模态大语言模型已实现高保真的视觉感知与探索性视觉生成。然而,在复杂推理任务中仍存在一个精度悖论:光学感知系统转录符号时未能捕捉逻辑拓扑结构,而基于像素的生成模型则产生缺乏数学精确性的视觉伪影。为弥合这一差距,我们提出将视觉输入的推理重新概念化为光学解压缩——即从压缩的视觉标记中重构潜在逻辑结构的过程。在“解析即推理”这一公理的指导下,我们引入思维与草稿(TwD)方法,该方法利用极简领域特定语言(DSL)作为基础中间表示。与直接幻觉生成答案的标准方法不同,TwD强制模型将其心智模型草拟为可执行代码,从而生成确定性的视觉证明以进行自我验证。为此,我们提出了视觉代数基准测试集VisAlg。实验表明,TwD可作为优越的认知支架。我们的工作建立了一个闭环系统,其中视觉生成并非作为创造性输出,而是作为逻辑验证器,为视觉推理提供了一条可泛化的路径。