Text-to-Image (T2I) generation methods based on diffusion model have garnered significant attention in the last few years. Although these image synthesis methods produce visually appealing results, they frequently exhibit spelling errors when rendering text within the generated images. Such errors manifest as missing, incorrect or extraneous characters, thereby severely constraining the performance of text image generation based on diffusion models. To address the aforementioned issue, this paper proposes a novel approach for text image generation, utilizing a pre-trained diffusion model (i.e., Stable Diffusion [27]). Our approach involves the design and training of a light-weight character-level text encoder, which replaces the original CLIP encoder and provides more robust text embeddings as conditional guidance. Then, we fine-tune the diffusion model using a large-scale dataset, incorporating local attention control under the supervision of character-level segmentation maps. Finally, by employing an inference stage refinement process, we achieve a notably high sequence accuracy when synthesizing text in arbitrarily given images. Both qualitative and quantitative results demonstrate the superiority of our method to the state of the art. Furthermore, we showcase several potential applications of the proposed UDiffText, including text-centric image synthesis, scene text editing, etc. Code and model will be available at https://github.com/ZYM-PKU/UDiffText .
翻译:基于扩散模型的文本到图像生成方法近年来受到广泛关注。尽管这些图像合成方法能生成视觉上令人满意的结果,但在渲染生成图像中的文本时,频繁出现拼写错误。这些错误表现为字符缺失、错误或冗余,严重制约了基于扩散模型的文本图像生成性能。针对上述问题,本文提出一种新颖的文本图像生成方法,利用预训练扩散模型(即Stable Diffusion [27])。我们设计并训练了一个轻量级字符级文本编码器,替代原始CLIP编码器,提供更鲁棒的文本嵌入作为条件引导。随后,在大规模数据集上微调扩散模型,并结合字符级分割图的监督引入局部注意力控制。最后,通过推理阶段精炼流程,在任意给定图像中合成文本时,实现了显著的高序列准确率。定性和定量结果均证明了本方法相较于现有技术的优越性。此外,我们展示了所提UDiffText的若干潜在应用,包括以文本为中心的图像合成、场景文本编辑等。代码与模型将开源至https://github.com/ZYM-PKU/UDiffText。