Scientific diagrams are vital tools for communicating structured knowledge across disciplines. However, they are often published as static raster images, losing symbolic semantics and limiting reuse. While Multimodal Large Language Models (MLLMs) offer a pathway to bridging vision and structure, existing methods lack semantic control and structural interpretability, especially on complex diagrams. We propose Draw with Thought (DwT), a training-free framework that guides MLLMs to reconstruct diagrams into editable mxGraph XML code through cognitively-grounded Chain-of-Thought reasoning. DwT enables interpretable and controllable outputs without model fine-tuning by dividing the task into two stages: Coarse-to-Fine Planning, which handles perceptual structuring and semantic specification, and Structure-Aware Code Generation, enhanced by format-guided refinement. To support evaluation, we release Plot2XML, a benchmark of 247 real-world scientific diagrams with gold-standard XML annotations. Extensive experiments across eight MLLMs show that our approach yields high-fidelity, semantically aligned, and structurally valid reconstructions, with human evaluations confirming strong alignment in both accuracy and visual aesthetics, offering a scalable solution for converting static visuals into executable representations and advancing machine understanding of scientific graphics.
翻译:科学图表是跨学科传播结构化知识的重要工具。然而,它们通常以静态栅格图像形式发布,丢失了符号语义并限制了重用。尽管多模态大语言模型(MLLMs)为连接视觉与结构提供了途径,但现有方法缺乏语义控制与结构可解释性,尤其在处理复杂图表时。我们提出思维绘图(DwT),一种免训练框架,通过基于认知的思维链推理引导MLLMs将图表重建为可编辑的mxGraph XML代码。DwT通过将任务分解为两个阶段实现可解释且可控的输出,无需模型微调:粗到细规划(处理感知结构与语义规范)和结构感知代码生成(通过格式引导优化增强)。为支持评估,我们发布了Plot2XML基准数据集,包含247个具有黄金标准XML标注的真实世界科学图表。在八个MLLMs上的大量实验表明,我们的方法能产生高保真度、语义对齐且结构有效的重建结果,人工评估也证实其在准确性和视觉美观度上均保持高度一致,为将静态视觉内容转换为可执行表示提供了可扩展的解决方案,并推动了机器对科学图形的理解。