AnyText: Multilingual Visual Text Generation And Editing

Diffusion model based Text-to-Image has achieved impressive achievements recently. Although current technology for synthesizing images is highly advanced and capable of generating images with high fidelity, it is still possible to give the show away when focusing on the text area in the generated image. To address this issue, we introduce AnyText, a diffusion-based multilingual visual text generation and editing model, that focuses on rendering accurate and coherent text in the image. AnyText comprises a diffusion pipeline with two primary elements: an auxiliary latent module and a text embedding module. The former uses inputs like text glyph, position, and masked image to generate latent features for text generation or editing. The latter employs an OCR model for encoding stroke data as embeddings, which blend with image caption embeddings from the tokenizer to generate texts that seamlessly integrate with the background. We employed text-control diffusion loss and text perceptual loss for training to further enhance writing accuracy. AnyText can write characters in multiple languages, to the best of our knowledge, this is the first work to address multilingual visual text generation. It is worth mentioning that AnyText can be plugged into existing diffusion models from the community for rendering or editing text accurately. After conducting extensive evaluation experiments, our method has outperformed all other approaches by a significant margin. Additionally, we contribute the first large-scale multilingual text images dataset, AnyWord-3M, containing 3 million image-text pairs with OCR annotations in multiple languages. Based on AnyWord-3M dataset, we propose AnyText-benchmark for the evaluation of visual text generation accuracy and quality. Our project will be open-sourced on https://github.com/tyxsspa/AnyText to improve and promote the development of text generation technology.

翻译：基于扩散模型的文本到图像技术在近期取得了令人瞩目的成果。尽管当前图像合成技术高度发达、能够生成高保真度的图像，但在关注生成图像中的文本区域时仍可能露出破绽。为解决这一问题，我们提出AnyText——一种基于扩散模型的多语种视觉文本生成与编辑模型，专注于在图像中呈现精准且连贯的文本。AnyText包含一个由两个核心模块构成的扩散管道：辅助潜在模块和文本嵌入模块。前者利用文本字形、位置及掩码图像等输入生成用于文本生成或编辑的潜在特征；后者采用OCR模型编码笔画数据作为嵌入向量，这些嵌入向量与分词器生成的图像标题嵌入向量融合，从而生成与背景无缝衔接的文本。我们采用文本控制扩散损失和文本感知损失进行训练以进一步提升书写准确性。AnyText能够书写多种语言的字符，据我们所知，这是首个解决多语种视觉文本生成的工作。值得提及的是，AnyText可无缝接入社区已有的扩散模型，精准实现文本渲染或编辑。经大量评估实验验证，我们的方法显著优于所有其他方法。此外，我们贡献了首个大规模多语种文本图像数据集AnyWord-3M，包含300万组具有多语种OCR标注的图像-文本对。基于AnyWord-3M数据集，我们提出用于评估视觉文本生成准确性与质量的AnyText-benchmark基准测试。本项目的开源代码位于https://github.com/tyxsspa/AnyText，旨在促进文本生成技术的研究与发展。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日