Plain text has become a prevalent interface for text-to-image synthesis. However, its limited customization options hinder users from accurately describing desired outputs. For example, plain text makes it hard to specify continuous quantities, such as the precise RGB color value or importance of each word. Furthermore, creating detailed text prompts for complex scenes is tedious for humans to write and challenging for text encoders to interpret. To address these challenges, we propose using a rich-text editor supporting formats such as font style, size, color, and footnote. We extract each word's attributes from rich text to enable local style control, explicit token reweighting, precise color rendering, and detailed region synthesis. We achieve these capabilities through a region-based diffusion process. We first obtain each word's region based on attention maps of a diffusion process using plain text. For each region, we enforce its text attributes by creating region-specific detailed prompts and applying region-specific guidance, and maintain its fidelity against plain-text generation through region-based injections. We present various examples of image generation from rich text and demonstrate that our method outperforms strong baselines with quantitative evaluations.
翻译:纯文本已成为文本到图像合成的主流界面。然而,其有限的定制选项阻碍了用户准确描述期望输出。例如,纯文本难以指定连续量,如精确的RGB颜色值或每个单词的重要性。此外,为复杂场景创建详细的文本提示,既对人类编写繁琐,也对文本编码器解释构成挑战。为解决这些问题,我们提出使用支持字体样式、大小、颜色和脚注等格式的富文本编辑器。我们从富文本中提取每个单词的属性,以实现局部样式控制、显式词权重调整、精确颜色渲染和详细区域合成。通过基于区域的扩散过程实现这些能力。我们首先基于使用纯文本的扩散过程的注意力图获取每个单词的区域。对于每个区域,我们通过创建区域特定的详细提示并应用区域特定引导来强制其文本属性,并通过基于区域的注入保持其与纯文本生成的一致性。我们展示了从富文本生成图像的各种示例,并通过定量评估证明我们的方法优于强基线方法。