Generating high-quality Scalable Vector Graphics (SVGs) from text remains a significant challenge. Existing LLM-based models that generate SVG code as a flat token sequence struggle with poor structural understanding and error accumulation, while optimization-based methods are slow and yield uneditable outputs. To address these limitations, we introduce SVGFusion, a unified framework that adapts the VAE-diffusion architecture to bridge the dual code-visual nature of SVGs. Our model features two core components: a Vector-Pixel Fusion Variational Autoencoder (VP-VAE) that learns a perceptually rich latent space by jointly encoding SVG code and its rendered image, and a Vector Space Diffusion Transformer (VS-DiT) that achieves globally coherent compositions through iterative refinement. Furthermore, this architecture is enhanced by a Rendering Sequence Modeling strategy, which ensures accurate object layering and occlusion. Evaluated on our novel SVGX-Dataset comprising 240k human-designed SVGs, SVGFusion establishes a new state-of-the-art, generating high-quality, editable SVGs that are strictly semantically aligned with the input text.
翻译:从文本生成高质量的可缩放矢量图形(SVG)仍是一项重大挑战。现有基于LLM的模型将SVG代码生成为扁平标记序列,存在结构理解差和错误累积问题;而基于优化的方法速度缓慢且输出不可编辑。针对这些局限,我们提出SVGFusion——一种统一框架,通过适配VAE-扩散架构来桥接SVG的代码-视觉双重属性。该模型包含两个核心组件:矢量-像素融合变分自编码器(VP-VAE),通过联合编码SVG代码及其渲染图像来学习感知丰富的潜空间;以及矢量空间扩散Transformer(VS-DiT),通过迭代优化实现全局协调的构图。此外,该架构通过渲染序列建模策略进一步增强,确保对象精确分层与遮挡。在包含24万个人工设计SVG的新数据集SVGX-Dataset上评估,SVGFusion达到了新的最优性能,生成的SVG质量高、可编辑且与输入文本严格语义对齐。