VFIG: Vectorizing Complex Figures in SVG with Vision-Language Models

Scalable Vector Graphics (SVG) are an essential format for technical illustration and digital design, offering precise resolution independence and flexible semantic editability. In practice, however, original vector source files are frequently lost or inaccessible, leaving only "flat" rasterized versions (e.g., PNG or JPEG) that are difficult to modify or scale. Manually reconstructing these figures is a prohibitively labor-intensive process, requiring specialized expertise to recover the original geometric intent. To bridge this gap, we propose VFIG, a family of Vision-Language Models trained for complex and high-fidelity figure-to-SVG conversion. While this task is inherently data-driven, existing datasets are typically small-scale and lack the complexity of professional diagrams. We address this by introducing VFIG-DATA, a large-scale dataset of 66K high-quality figure-SVG pairs, curated from a diverse mix of real-world paper figures and procedurally generated diagrams. Recognizing that SVGs are composed of recurring primitives and hierarchical local structures, we introduce a coarse-to-fine training curriculum that begins with supervised fine-tuning (SFT) to learn atomic primitives and transitions to reinforcement learning (RL) refinement to optimize global diagram fidelity, layout consistency, and topological edge cases. Finally, we introduce VFIG-BENCH, a comprehensive evaluation suite with novel metrics designed to measure the structural integrity of complex figures. VFIG achieves state-of-the-art performance among open-source models and performs on par with GPT-5.2, achieving a VLM-Judge score of 0.829 on VFIG-BENCH.

翻译：摘要：可缩放矢量图形（SVG）是技术插图和数字设计的重要格式，具有精确的解析度无关性和灵活的语义可编辑性。然而在实际应用中，原始矢量源文件经常丢失或无法获取，仅留下难以修改或缩放的“扁平化”光栅化版本（如PNG或JPEG）。手动重建这些图形是一项劳动密集且成本高昂的过程，需要专业领域知识来恢复原始几何意图。为解决这一难题，我们提出VFIG——一个专为高保真复杂图形到SVG转换而训练的视觉语言模型系列。尽管该任务本质上是数据驱动的，但现有数据集通常规模较小且缺乏专业图表的复杂性。为此，我们引入VFIG-DATA——一个包含66K高质量图形-SVG对的大规模数据集，这些数据来自真实论文图表和程序化生成图形的多样化混合。考虑到SVG由重复基元和层次化局部结构组成，我们提出一种从粗到精的训练策略：首先通过监督式微调（SFT）学习原子基元，然后过渡到强化学习（RL）精炼以优化全局图表保真度、布局一致性和拓扑边界案例。最后，我们提出VFIG-BENCH——一个包含新型指标的综合性评估套件，用于衡量复杂图形的结构完整性。VFIG在开源模型中实现了最先进性能，并与GPT-5.2性能相当，在VFIG-BENCH上达到0.829的VLM-Judge得分。