LiveFigure: Generating Editable Scientific Illustration with VLM Agents

Scientific illustrations are essential for depicting conceptual designs, methodologies, and experimental workflows in research, playing a pivotal role in communicating complex academic insights. However, creating high-quality scientific illustrations remains a labor-intensive task for human scientists. While recent generative image models have advanced prompt-based editing, the synthesis of fully editable figures remains a fundamental challenge. Valid editability involves structured transformations of graphical elements, scales, attributes, and text, rather than simple pixel-level changes. Existing models generate raster outputs that do not support manual correction or layout adjustment, limiting their utility in scientific publishing, where editable vector figures are typically required for submission. To address this challenge, we introduce LiveFigure, an agentic framework driven by VLM agents that imitates the multi-step drawing workflow of human researchers. It first plans figure blueprints by drawing inspiration from high-quality references in previous works, then generates executable scripts that produce figures via the PowerPoint interface based on skills and experience, and finally refines the outputs with targeted visual diagnostics, producing fully vectorized, editable figures that meet publication standards. Extensive experiments demonstrate that LiveFigure generates inherently editable figures, achieving 80% publication-readiness in only 17 manual edits, far surpassing the 24% rate of the strongest baseline, NanoBanana. Human preference studies further validate this advantage, with LiveFigure securing a 60% win rate against NanoBanana. Our code is available at https://github.com/tsinghua-fib-lab/LiveFigure.git.

翻译：科学插图对于展示研究中的概念设计、方法及实验流程至关重要，在传达复杂学术见解方面发挥着核心作用。然而，高质量科学插图的制作仍是人类科学家耗时费力的任务。尽管近期生成式图像模型推动了基于提示的编辑技术发展，但全可编辑插图的合成仍是一个根本性挑战。有效的可编辑性涉及图形元素、尺度、属性和文本的结构化变换，而非简单的像素级修改。现有模型生成的栅格输出无法支持人工修正或布局调整，这限制了它们在科学出版领域的实用性——该领域通常要求提交可编辑的矢量图形。为应对这一挑战，我们提出了LiveFigure——一个由VLM智能体驱动的代理框架，它模仿人类研究人员多步骤的绘图流程：首先从既往高质量参考文献中汲取灵感规划蓝图，然后基于技能与经验通过PowerPoint接口生成可执行脚本以创建插图，最后通过针对性视觉诊断优化输出，生成完全矢量化的、符合出版标准的可编辑插图。大量实验表明，LiveFigure能生成原生可编辑的插图，仅需17次手动编辑即可达到80%的出版就绪率，远超最强基线NanoBanana的24%就绪率。人工偏好研究进一步验证了这一优势，LiveFigure对NanoBanana取得了60%的胜率。我们的代码开源在https://github.com/tsinghua-fib-lab/LiveFigure.git。