Current image generation models struggle to reliably produce well-formed visual text. In this paper, we investigate a key contributing factor: popular text-to-image models lack character-level input features, making it much harder to predict a word's visual makeup as a series of glyphs. To quantify this effect, we conduct a series of experiments comparing character-aware vs. character-blind text encoders. In the text-only domain, we find that character-aware models provide large gains on a novel spelling task (WikiSpell). Applying our learnings to the visual domain, we train a suite of image generation models, and show that character-aware variants outperform their character-blind counterparts across a range of novel text rendering tasks (our DrawText benchmark). Our models set a much higher state-of-the-art on visual spelling, with 30+ point accuracy gains over competitors on rare words, despite training on far fewer examples.
翻译:当前图像生成模型难以稳定生成结构良好的视觉文本。本文探究了一个关键影响因素:主流文本到图像模型缺乏字符级输入特征,导致将单词的视觉结构预测为一组字形序列变得更为困难。为量化这一影响,我们进行了一系列实验,比较了字符感知与字符盲文本编码器的差异。在纯文本领域,我们发现字符感知模型在新型拼写任务(WikiSpell)上取得了显著提升。将研究成果应用于视觉领域后,我们训练了一系列图像生成模型,并证明在多项新型文本渲染任务(我们的DrawText基准)中,字符感知变体优于字符盲变体。我们的模型在视觉拼写任务上设立了新的最高水准,尽管训练样本少得多,但在罕见词上的准确率仍比竞品高出30多个百分点。