Generating accurate multilingual text with diffusion models has long been desired but remains challenging. Recent methods have made progress in rendering text in a single language, but rendering arbitrary languages is still an unexplored area. This paper introduces EasyText, a text rendering framework based on DiT (Diffusion Transformer), which connects denoising latents with multilingual character tokens encoded as character tokens. We propose character positioning encoding and position encoding interpolation techniques to achieve controllable and precise text rendering. Additionally, we construct a large-scale synthetic text image dataset with 1 million multilingual image-text annotations as well as a high-quality dataset of 20K annotated images, which are used for pretraining and fine-tuning respectively. Extensive experiments and evaluations demonstrate the effectiveness and advancement of our approach in multilingual text rendering, visual quality, and layout-aware text integration.
翻译:利用扩散模型生成准确的多语言文本一直是人们期望但具有挑战性的任务。现有方法在单语言文本渲染方面已取得进展,但任意语言的文本渲染仍是尚未探索的领域。本文提出EasyText,一种基于DiT(扩散Transformer)的文本渲染框架,该框架将去噪隐变量与编码为字符标记的多语言字符标记相连接。我们提出字符定位编码和位置编码插值技术,以实现可控且精确的文本渲染。此外,我们构建了一个包含100万张多语言图文标注的大规模合成文本图像数据集,以及一个包含2万张标注图像的高质量数据集,分别用于预训练和微调。大量实验和评估证明了我们的方法在多语言文本渲染、视觉质量和布局感知文本集成方面的有效性和先进性。