Recently, with the rapid advancements of generative models, the field of visual text generation has witnessed significant progress. However, it is still challenging to render high-quality text images in real-world scenarios, as three critical criteria should be satisfied: (1) Fidelity: the generated text images should be photo-realistic and the contents are expected to be the same as specified in the given conditions; (2) Reasonability: the regions and contents of the generated text should cohere with the scene; (3) Utility: the generated text images can facilitate related tasks (e.g., text detection and recognition). Upon investigation, we find that existing methods, either rendering-based or diffusion-based, can hardly meet all these aspects simultaneously, limiting their application range. Therefore, we propose in this paper a visual text generator (termed SceneVTG), which can produce high-quality text images in the wild. Following a two-stage paradigm, SceneVTG leverages a Multimodal Large Language Model to recommend reasonable text regions and contents across multiple scales and levels, which are used by a conditional diffusion model as conditions to generate text images. Extensive experiments demonstrate that the proposed SceneVTG significantly outperforms traditional rendering-based methods and recent diffusion-based methods in terms of fidelity and reasonability. Besides, the generated images provide superior utility for tasks involving text detection and text recognition. Code and datasets are available at AdvancedLiterateMachinery.
翻译:近年来,随着生成模型的快速发展,视觉文本生成领域取得了显著进展。然而,在真实场景中渲染高质量文本图像仍具挑战性,因为需要满足三个关键标准:(1) 保真度:生成的文本图像应具有照片级真实感,且内容需与给定条件保持一致;(2) 合理性:生成文本的区域和内容应与场景协调;(3) 实用性:生成的文本图像应能促进相关任务(如文本检测与识别)。通过调研发现,现有方法(无论是基于渲染还是基于扩散的方法)都难以同时满足这些要求,限制了其应用范围。为此,本文提出一种视觉文本生成器(称为SceneVTG),能够在野外场景中生成高质量文本图像。SceneVTG遵循两阶段范式:首先利用多模态大语言模型推荐多尺度、多层次上的合理文本区域与内容,随后将其作为条件输入条件扩散模型以生成文本图像。大量实验表明,所提出的SceneVTG在保真度与合理性方面显著优于传统基于渲染的方法和近期基于扩散的方法。此外,生成的图像在文本检测和文本识别任务中展现出卓越的实用性。代码与数据集已发布于AdvancedLiterateMachinery平台。