Plain text has become a prevalent interface for text-to-image synthesis. However, its limited customization options hinder users from accurately describing desired outputs. For example, plain text makes it hard to specify continuous quantities, such as the precise RGB color value or importance of each word. Furthermore, creating detailed text prompts for complex scenes is tedious for humans to write and challenging for text encoders to interpret. To address these challenges, we propose using a rich-text editor supporting formats such as font style, size, color, and footnote. We extract each word's attributes from rich text to enable local style control, explicit token reweighting, precise color rendering, and detailed region synthesis. We achieve these capabilities through a region-based diffusion process. We first obtain each word's region based on attention maps of a diffusion process using plain text. For each region, we enforce its text attributes by creating region-specific detailed prompts and applying region-specific guidance, and maintain its fidelity against plain-text generation through region-based injections. We present various examples of image generation from rich text and demonstrate that our method outperforms strong baselines with quantitative evaluations.
翻译:纯文本已成为文本到图像合成的普遍接口。然而,其有限的定制选项阻碍了用户准确描述期望的输出。例如,纯文本难以指定连续量,如精确的RGB颜色值或每个词的重要性。此外,为复杂场景创建详细的文本提示对人类而言撰写繁琐,对文本编码器而言解析也颇具挑战。为应对这些挑战,我们提出使用支持字体样式、大小、颜色和脚注等格式的富文本编辑器。我们从富文本中提取每个词的属性,以实现局部风格控制、显式标记重加权、精确色彩渲染和细节区域合成。我们通过基于区域的扩散过程实现这些功能。首先,基于使用纯文本的扩散过程的注意力图获取每个词的区域。对于每个区域,我们通过创建区域特定的详细提示并应用区域特定的引导来强化其文本属性,并通过基于区域的注入保持其相对于纯文本生成的保真度。我们展示了从富文本生成图像的各种示例,并通过定量评估证明我们的方法优于强基线模型。