To leverage LLMs for visual synthesis, traditional methods convert raster image information into discrete grid tokens through specialized visual modules, while disrupting the model's ability to capture the true semantic representation of visual scenes. This paper posits that an alternative representation of images, vector graphics, can effectively surmount this limitation by enabling a more natural and semantically coherent segmentation of the image information. Thus, we introduce StrokeNUWA, a pioneering work exploring a better visual representation ''stroke tokens'' on vector graphics, which is inherently visual semantics rich, naturally compatible with LLMs, and highly compressed. Equipped with stroke tokens, StrokeNUWA can significantly surpass traditional LLM-based and optimization-based methods across various metrics in the vector graphic generation task. Besides, StrokeNUWA achieves up to a 94x speedup in inference over the speed of prior methods with an exceptional SVG code compression ratio of 6.9%.
翻译:为利用大语言模型进行视觉合成,传统方法通过专门的视觉模块将栅格图像信息转化为离散网格标记,但这破坏了模型对视觉场景真实语义表征的捕捉能力。本文提出,图像的另一种表征形式——矢量图形——能够通过实现图像信息更自然且语义一致的分割,有效克服这一局限。为此,我们引入StrokeNUWA,这是一项开创性工作,探索了矢量图形上一种更优的视觉表征“笔画标记”,其本质具有丰富的视觉语义,天然兼容大语言模型,且具备高度压缩性。借助笔画标记,StrokeNUWA在矢量图形生成任务的多项指标上显著超越了基于大语言模型和基于优化的传统方法。此外,与先前方法相比,StrokeNUWA的推理速度提升高达94倍,同时实现了6.9%的卓越SVG代码压缩比。