Diffusion models have become a new generative paradigm for text generation. Considering the discrete categorical nature of text, in this paper, we propose \textsc{RenderDiffusion}, a novel diffusion approach for text generation via text-guided image generation. Our key idea is to render the target text as a \emph{glyph image} containing visual language content. In this way, conditional text generation can be cast as a glyph image generation task, and it is then natural to apply continuous diffusion models to discrete texts. Specially, we utilize a cascaded architecture (\ie a base and a super-resolution diffusion model) to generate high-fidelity glyph images, conditioned on the input text. Furthermore, we design a text grounding module to transform and refine the visual language content from generated glyph images into the final texts. In experiments over four conditional text generation tasks and two classes of metrics (\ie quality and diversity), \textsc{RenderDiffusion} can achieve comparable or even better results than several baselines, including pretrained language models. Our model also makes significant improvements compared to the recent diffusion model.
翻译:扩散模型已成为文本生成的一种新生成范式。针对文本的离散类别特性,本文提出RenderDiffusion——一种通过文本引导图像生成实现文本生成的新型扩散方法。核心思想是将目标文本渲染为包含视觉语言内容的字形图像。通过这种方式,条件文本生成可转化为字形图像生成任务,从而将连续扩散模型自然地应用于离散文本。具体而言,我们采用级联架构(即基础扩散模型与超分辨率扩散模型)生成高保真字形图像,并以输入文本为条件。此外,我们设计了一个文本接地模块,用于将生成字形图像中的视觉语言内容转换并精炼为最终文本。在四个条件文本生成任务及两类评价指标(质量与多样性)的实验表明,RenderDiffusion能够取得与若干基线方法(包括预训练语言模型)相当甚至更优的结果。与近期扩散模型相比,我们的模型亦有显著提升。