Diffusion model based Text-to-Image has achieved impressive achievements recently. Although current technology for synthesizing images is highly advanced and capable of generating images with high fidelity, it is still possible to give the show away when focusing on the text area in the generated image. To address this issue, we introduce AnyText, a diffusion-based multilingual visual text generation and editing model, that focuses on rendering accurate and coherent text in the image. AnyText comprises a diffusion pipeline with two primary elements: an auxiliary latent module and a text embedding module. The former uses inputs like text glyph, position, and masked image to generate latent features for text generation or editing. The latter employs an OCR model for encoding stroke data as embeddings, which blend with image caption embeddings from the tokenizer to generate texts that seamlessly integrate with the background. We employed text-control diffusion loss and text perceptual loss for training to further enhance writing accuracy. AnyText can write characters in multiple languages, to the best of our knowledge, this is the first work to address multilingual visual text generation. It is worth mentioning that AnyText can be plugged into existing diffusion models from the community for rendering or editing text accurately. After conducting extensive evaluation experiments, our method has outperformed all other approaches by a significant margin. Additionally, we contribute the first large-scale multilingual text images dataset, AnyWord-3M, containing 3 million image-text pairs with OCR annotations in multiple languages. Based on AnyWord-3M dataset, we propose AnyText-benchmark for the evaluation of visual text generation accuracy and quality. Our project will be open-sourced on https://github.com/tyxsspa/AnyText to improve and promote the development of text generation technology.
翻译:基于扩散模型的文本到图像技术近期取得了令人瞩目的成果。尽管当前图像合成技术高度发达,能够生成高保真度的图像,但在关注生成图像中的文本区域时,仍可能暴露破绽。为解决这一问题,我们提出AnyText——一种基于扩散的多语言视觉文本生成与编辑模型,专注于在图像中渲染准确且连贯的文本。AnyText包含一个扩散管道,主要由两个核心组件构成:辅助潜变量模块和文本嵌入模块。前者利用文本字形、位置和遮罩图像等输入,生成用于文本生成或编辑的潜变量特征;后者采用OCR模型将笔画数据编码为嵌入向量,并与分词器生成的图像描述嵌入融合,以生成与背景无缝融合的文本。我们使用文本控制扩散损失和文本感知损失进行训练,以进一步提升书写准确性。AnyText能够书写多种语言的文字,据我们所知,这是首个解决多语言视觉文本生成的工作。值得一提的是,AnyText可嵌入社区现有的扩散模型,以实现精确的文本渲染或编辑。经过广泛的评估实验,我们的方法在所有其他方法中取得了显著优势。此外,我们贡献了首个大规模多语言文本图像数据集AnyWord-3M,包含300万对带有多种语言OCR标注的图像-文本对。基于AnyWord-3M数据集,我们提出AnyText-benchmark,用于评估视觉文本生成的准确性和质量。我们的项目将在https://github.com/tyxsspa/AnyText开源,以促进文本生成技术的发展。