Recently, Glyph-ByT5 has achieved highly accurate visual text rendering performance in graphic design images. However, it still focuses solely on English and performs relatively poorly in terms of visual appeal. In this work, we address these two fundamental limitations by presenting Glyph-ByT5-v2 and Glyph-SDXL-v2, which not only support accurate visual text rendering for 10 different languages but also achieve much better aesthetic quality. To achieve this, we make the following contributions: (i) creating a high-quality multilingual glyph-text and graphic design dataset consisting of more than 1 million glyph-text pairs and 10 million graphic design image-text pairs covering nine other languages, (ii) building a multilingual visual paragraph benchmark consisting of 1,000 prompts, with 100 for each language, to assess multilingual visual spelling accuracy, and (iii) leveraging the latest step-aware preference learning approach to enhance the visual aesthetic quality. With the combination of these techniques, we deliver a powerful customized multilingual text encoder, Glyph-ByT5-v2, and a strong aesthetic graphic generation model, Glyph-SDXL-v2, that can support accurate spelling in 10 different languages. We perceive our work as a significant advancement, considering that the latest DALL-E3 and Ideogram 1.0 still struggle with the multilingual visual text rendering task.
翻译:近期,Glyph-ByT5 在平面设计图像中实现了高度精确的视觉文本渲染性能。然而,它仍然仅专注于英语,并且在视觉吸引力方面表现相对欠佳。在本工作中,我们通过提出 Glyph-ByT5-v2 和 Glyph-SDXL-v2 来解决这两个基本限制,它们不仅支持对 10 种不同语言进行精确的视觉文本渲染,还实现了更好的美学质量。为此,我们做出了以下贡献:(i) 创建了一个高质量的多语言字形-文本与平面设计数据集,包含超过 100 万个字形-文本对和 1000 万个平面设计图像-文本对,覆盖了其他九种语言;(ii) 构建了一个包含 1000 个提示的多语言视觉段落基准,每种语言 100 个,用于评估多语言视觉拼写准确性;(iii) 利用最新的步感知偏好学习方法以提升视觉美学质量。通过结合这些技术,我们提供了一个强大的定制化多语言文本编码器 Glyph-ByT5-v2,以及一个强大的美学图形生成模型 Glyph-SDXL-v2,它们能够支持 10 种不同语言的精确拼写。考虑到最新的 DALL-E3 和 Ideogram 1.0 在多语言视觉文本渲染任务上仍然存在困难,我们认为我们的工作是一项重大进展。