Visual text rendering poses a fundamental challenge for contemporary text-to-image generation models, with the core problem lying in text encoder deficiencies. To achieve accurate text rendering, we identify two crucial requirements for text encoders: character awareness and alignment with glyphs. Our solution involves crafting a series of customized text encoder, Glyph-ByT5, by fine-tuning the character-aware ByT5 encoder using a meticulously curated paired glyph-text dataset. We present an effective method for integrating Glyph-ByT5 with SDXL, resulting in the creation of the Glyph-SDXL model for design image generation. This significantly enhances text rendering accuracy, improving it from less than $20\%$ to nearly $90\%$ on our design image benchmark. Noteworthy is Glyph-SDXL's newfound ability for text paragraph rendering, achieving high spelling accuracy for tens to hundreds of characters with automated multi-line layouts. Finally, through fine-tuning Glyph-SDXL with a small set of high-quality, photorealistic images featuring visual text, we showcase a substantial improvement in scene text rendering capabilities in open-domain real images. These compelling outcomes aim to encourage further exploration in designing customized text encoders for diverse and challenging tasks.
翻译:摘要:视觉文本渲染对当代文本到图像生成模型构成了根本性挑战,其核心问题在于文本编码器的缺陷。为实现精确文本渲染,我们确定了文本编码器的两个关键需求:字符感知能力以及与字形的对齐能力。我们的解决方案是通过使用精心构建的配对字形-文本数据集,对具有字符感知能力的ByT5编码器进行微调,进而设计出一系列定制化文本编码器Glyph-ByT5。我们提出了一种将Glyph-ByT5与SDXL有效集成的方法,由此创建了用于设计图像生成的Glyph-SDXL模型。该模型显著提升了文本渲染精度,在我们的设计图像基准测试中,准确率从低于20%提升至近90%。值得注意的是,Glyph-SDXL具备了文本段落渲染的新能力,能够通过自动多行布局实现数十至数百字符的高拼写准确率。最后,通过使用少量包含视觉文本的高质量逼真图像对Glyph-SDXL进行微调,我们展示了其在开放域真实图像中场景文本渲染能力的显著提升。这些引人注目的成果旨在鼓励进一步探索为多样化且具挑战性的任务设计定制化文本编码器。