Graphics Program Synthesis is pivotal for interpreting and editing visual data, effectively facilitating the reverse-engineering of static visuals into editable TikZ code. While TikZ is the de facto standard for scientific schematics due to its programmatic flexibility, its requirement for rigorous spatial precision presents a significant challenge for Multimodal Large Language Models. Progress is currently stifled by two primary gaps: (1) Data Quality Gap: existing image-TikZ corpora often lack strict executability and reliable visual alignment; (2) Evaluation Gap: a lack of benchmarks for both structural and visual fidelity. To address these, we present a closed-loop framework featuring: SciTikZ-230K, a large-scale, high-quality dataset from our Execution-Centric Data Engine covering 11 diverse scientific disciplines; SciTikZ-Bench, a multifaceted benchmark spanning from basic geometric constructs to intricate hierarchical schematics to evaluate both visual fidelity and structural logic. To further broaden the scope of visual-code optimization methodology, we introduce a novel Dual Self-Consistency Reinforcement Learning optimization paradigm, which utilizes Round-Trip Verification to penalize degenerate code and boost overall self-consistency. Empowered by these, our trained model SciTikZer-8B achieves state-of-the-art performance, consistently outperforming proprietary giants like Gemini-2.5-Pro and massive models like Qwen3-VL-235B-A22B-Instruct.
翻译:图形程序合成是解释与编辑视觉数据的关键技术,它能有效将静态视觉内容逆向工程转换为可编辑的TikZ代码。尽管TikZ凭借其程序化灵活性成为科学示意图的事实标准,但其对空间精度的严苛要求给多模态大语言模型带来了重大挑战。当前进展受限于两个核心鸿沟:(1)数据质量鸿沟:现有图像-TikZ语料库普遍缺乏严格的可执行性与可靠的视觉对齐;(2)评估鸿沟:缺乏同时兼顾结构精度与视觉保真度的基准测试。为应对这些挑战,我们提出闭环框架,包含:SciTikZ-230K——基于执行中心数据引擎构建的覆盖11个科学领域的大规模高质量数据集;SciTikZ-Bench——涵盖从基础几何构建到复杂层级示意图的多维基准测试,用于评估视觉保真度与结构逻辑性。为拓展视觉-代码优化方法的边界,我们创新性地提出双重自洽性强化学习优化范式,通过往返验证机制惩罚退化代码并增强整体自洽性。基于上述成果,训练得到的SciTikZer-8B模型达到业界领先性能,持续超越Gemini-2.5-Pro等专有巨头模型以及Qwen3-VL-235B-A22B-Instruct等超大规模模型。