Diffusion models have become a new generative paradigm for text generation. Considering the discrete categorical nature of text, in this paper, we propose GlyphDiffusion, a novel diffusion approach for text generation via text-guided image generation. Our key idea is to render the target text as a glyph image containing visual language content. In this way, conditional text generation can be cast as a glyph image generation task, and it is then natural to apply continuous diffusion models to discrete texts. Specially, we utilize a cascaded architecture (ie a base and a super-resolution diffusion model) to generate high-fidelity glyph images, conditioned on the input text. Furthermore, we design a text grounding module to transform and refine the visual language content from generated glyph images into the final texts. In experiments over four conditional text generation tasks and two classes of metrics (ie quality and diversity), GlyphDiffusion can achieve comparable or even better results than several baselines, including pretrained language models. Our model also makes significant improvements compared to the recent diffusion model.
翻译:扩散模型已成为文本生成的新范式。考虑到文本的离散类别特性,本文提出GlyphDiffusion,一种通过文本引导图像生成实现文本生成的新型扩散方法。我们的核心思想是将目标文本渲染为包含视觉语言内容的字形图像。通过这种方式,条件文本生成可转化为字形图像生成任务,从而自然地将连续扩散模型应用于离散文本。具体而言,我们利用级联架构(即基础扩散模型与超分辨率扩散模型)生成高保真字形图像,并以输入文本为条件。此外,我们设计了文本接地模块,将生成字形图像中的视觉语言内容转换并精炼为最终文本。在四个条件文本生成任务及两类指标(质量与多样性)的实验评估中,GlyphDiffusion取得了与包括预训练语言模型在内的多个基线方法相当甚至更优的结果。我们的模型相比近期扩散模型也有显著提升。