While Chain-of-Thought (CoT) prompting has significantly advanced the reasoning capabilities of Multimodal Large Language Models (MLLMs), relying solely on linear text sequences remains a bottleneck for complex tasks. We observe that even when auxiliary visual elements are interleaved, they are often treated as static snapshots within a one-dimensional, unstructured reasoning chain. We argue that such approaches treat reasoning history as an immutable stream: correcting a local error necessitates either generating verbose downstream corrections or regenerating the entire context. This forces the model to implicitly maintain and track state updates, significantly increasing token consumption and cognitive load. This limitation is particularly acute in high-dimensional domains, such as geometry and SVG design, where the textual expression of CoT lacks explicit visual guidance, further constraining the model's reasoning precision. To bridge this gap, we introduce \textbf{Canvas-of-Thought (Canvas-CoT)}. By leveraging a HTML Canvas as an external reasoning substrate, Canvas-CoT empowers the model to perform atomic, DOM-based CRUD operations. This architecture enables in-place state revisions without disrupting the surrounding context, allowing the model to explicitly maintain the "ground truth". Furthermore, we integrate a rendering-based critique loop that serves as a hard constraint validator, providing explicit visual feedback to resolve complex tasks that are difficult to articulate through text alone. Extensive experiments on VCode, RBench-V, and MathVista demonstrate that Canvas-CoT significantly outperforms existing baselines, establishing a new paradigm for context-efficient multimodal reasoning.
翻译:尽管思维链(CoT)提示技术显著提升了多模态大语言模型(MLLMs)的推理能力,但仅依赖线性文本序列仍然是处理复杂任务时的瓶颈。我们观察到,即使引入了辅助视觉元素,它们也往往被视为一维非结构化推理链中的静态快照。我们认为,此类方法将推理历史视为不可变的流:修正局部错误需要生成冗长的下游修正或重新生成整个上下文。这迫使模型隐式地维护和跟踪状态更新,显著增加了令牌消耗和认知负荷。这一局限在高维领域(如几何与SVG设计)中尤为突出,因为CoT的文本表达缺乏明确的视觉引导,进一步制约了模型的推理精度。为弥合这一差距,我们提出了**思维画布(Canvas-CoT)**。该方法利用HTML Canvas作为外部推理基底,使模型能够执行基于DOM的原子级CRUD操作。该架构支持在不破坏周边上下文的情况下进行原位状态修订,使模型能够显式地维护“真实状态”。此外,我们集成了基于渲染的批判循环作为硬约束验证器,提供明确的视觉反馈,以解决仅通过文本难以清晰表述的复杂任务。在VCode、RBench-V和MathVista上的大量实验表明,Canvas-CoT显著优于现有基线,为上下文高效的多模态推理建立了新范式。