Multimodal Large Language Models (MLLMs) have shown promising capabilities in generating Scalable Vector Graphics (SVG) via direct code synthesis. However, existing paradigms typically adopt an open-loop "blind drawing" approach, where models generate symbolic code sequences without perceiving intermediate visual outcomes. This methodology severely underutilizes the powerful visual priors embedded in MLLMs vision encoders, treating SVG generation as a disjointed textual sequence modeling task rather than an integrated visuo-spatial one. Consequently, models struggle to reason about partial canvas states and implicit occlusion relationships, which are visually explicit but textually ambiguous. To bridge this gap, we propose Render-in-the-Loop, a novel generation paradigm that reformulates SVG synthesis as a step-wise, visual-context-aware process. By rendering intermediate code states into a cumulative canvas, the model explicitly observes the evolving visual context at each step, leveraging on-the-fly feedback to guide subsequent generation. However, we demonstrate that applying this visual loop naively to off-the-shelf models is suboptimal due to their inability to leverage incremental visual-code mappings. To address this, we first utilize fine-grained path decomposition to construct dense multi-step visual trajectories, and then introduce a Visual Self-Feedback (VSF) training strategy to condition the next primitive generation on intermediate visual states. Furthermore, a Render-and-Verify (RaV) inference mechanism is proposed to effectively filter degenerate and redundant primitives. Our framework, instantiated on a multimodal foundation model, outperforms strong open-weight baselines on the standard MMSVGBench. This result highlights the remarkable data efficiency and generalization capability of our Render-in-the-Loop paradigm for both Text-to-SVG and Image-to-SVG tasks.
翻译:多模态大语言模型通过直接代码合成,在生成可缩放矢量图形方面展现出巨大潜力。然而,现有范式通常采用开环的“盲绘”方式,即模型在生成符号代码序列时,无法感知中间的可视化结果。这种策略严重低估了多模态大语言模型视觉编码器中嵌入的强大视觉先验信息,将SVG生成视为一项割裂的文本序列建模任务,而非一项完整的视觉-空间建模任务。因此,模型难以推理部分画布状态和隐含的遮挡关系——这些问题虽然在视觉上明确,但在文本上却具有歧义。为弥合这一差距,我们提出“Render-in-the-Loop”,一种新型生成范式,将SVG合成重构为逐步、视觉上下文感知的过程。通过将中间代码状态渲染到累积画布上,模型在每一步骤中有意识地观察不断演变的视觉上下文,并利用即时反馈指导后续生成。然而,我们证明,直接将这种视觉循环应用于现成模型效果欠佳,因为这些模型无法利用增量式的视觉-代码映射。为解决此问题,我们首先利用细粒度路径分解构建密集的多步骤视觉轨迹,然后引入一种视觉自反馈训练策略,根据中间视觉状态来约束下一个图元的生成。此外,我们提出了一个“渲染-验证”推理机制,以有效过滤退化和冗余的图元。我们的框架基于多模态基础模型实现,在标准基准MMSVGBench上超越了强基线开放权重模型。该结果突显了我们的Render-in-the-Loop范式在文本到SVG和图像到SVG任务中卓越的数据效率和泛化能力。