SVGFusion: Scalable Text-to-SVG Generation via Vector Space Diffusion

The generation of Scalable Vector Graphics (SVG) assets from textual data remains a significant challenge, largely due to the scarcity of high-quality vector datasets and the limitations in scalable vector representations required for modeling intricate graphic distributions. This work introduces SVGFusion, a Text-to-SVG model capable of scaling to real-world SVG data without reliance on a text-based discrete language model or prolonged SDS optimization. The essence of SVGFusion is to learn a continuous latent space for vector graphics with a popular Text-to-Image framework. Specifically, SVGFusion consists of two modules: a Vector-Pixel Fusion Variational Autoencoder (VP-VAE) and a Vector Space Diffusion Transformer (VS-DiT). VP-VAE takes both the SVGs and corresponding rasterizations as inputs and learns a continuous latent space, whereas VS-DiT learns to generate a latent code within this space based on the text prompt. Based on VP-VAE, a novel rendering sequence modeling strategy is proposed to enable the latent space to embed the knowledge of construction logics in SVGs. This empowers the model to achieve human-like design capabilities in vector graphics, while systematically preventing occlusion in complex graphic compositions. Moreover, our SVGFusion's ability can be continuously improved by leveraging the scalability of the VS-DiT by adding more VS-DiT blocks. A large-scale SVG dataset is collected to evaluate the effectiveness of our proposed method. Extensive experimentation has confirmed the superiority of our SVGFusion over existing SVG generation methods, achieving enhanced quality and generalizability, thereby establishing a novel framework for SVG content creation. Code, model, and data will be released at: \href{https://ximinng.github.io/SVGFusionProject/}{https://ximinng.github.io/SVGFusionProject/}

翻译：从文本数据生成可缩放矢量图形（SVG）资源仍然是一个重大挑战，这主要归因于高质量矢量数据集的稀缺性，以及对建模复杂图形分布所需的可扩展矢量表示的限制。本文提出了SVGFusion，一种文本到SVG模型，能够扩展到真实世界的SVG数据，而无需依赖基于文本的离散语言模型或冗长的SDS优化。SVGFusion的核心思想是利用流行的文本到图像框架学习矢量图形的连续潜在空间。具体而言，SVGFusion由两个模块组成：矢量-像素融合变分自编码器（VP-VAE）和向量空间扩散Transformer（VS-DiT）。VP-VAE同时接收SVG及其对应的栅格化图像作为输入，并学习一个连续潜在空间；而VS-DiT则学习基于文本提示在该空间内生成潜在代码。基于VP-VAE，本文提出了一种新颖的渲染序列建模策略，使潜在空间能够嵌入SVG中构造逻辑的知识。这使模型能够在矢量图形中实现类人的设计能力，同时系统性地防止复杂图形组合中的遮挡问题。此外，通过增加更多VS-DiT模块以利用其可扩展性，我们的SVGFusion能力可以持续提升。我们收集了一个大规模的SVG数据集来评估所提方法的有效性。大量实验证实了我们的SVGFusion相对于现有SVG生成方法的优越性，实现了更高的质量和泛化能力，从而为SVG内容创建建立了一个新颖的框架。代码、模型和数据将在以下地址发布：\href{https://ximinng.github.io/SVGFusionProject/}{https://ximinng.github.io/SVGFusionProject/}