FRAP: Faithful and Realistic Text-to-Image Generation with Adaptive Prompt Weighting

Text-to-image (T2I) diffusion models have demonstrated impressive capabilities in generating high-quality images given a text prompt. However, ensuring the prompt-image alignment remains a considerable challenge, i.e., generating images that faithfully align with the prompt's semantics. Recent works attempt to improve the faithfulness by optimizing the latent code, which potentially could cause the latent code to go out-of-distribution and thus produce unrealistic images. In this paper, we propose FRAP, a simple, yet effective approach based on adaptively adjusting the per-token prompt weights to improve prompt-image alignment and authenticity of the generated images. We design an online algorithm to adaptively update each token's weight coefficient, which is achieved by minimizing a unified objective function that encourages object presence and the binding of object-modifier pairs. Through extensive evaluations, we show FRAP generates images with significantly higher prompt-image alignment to prompts from complex datasets, while having a lower average latency compared to recent latent code optimization methods, e.g., 4 seconds faster than D&B on the COCO-Subject dataset. Furthermore, through visual comparisons and evaluation on the CLIP-IQA-Real metric, we show that FRAP not only improves prompt-image alignment but also generates more authentic images with realistic appearances. We also explore combining FRAP with prompt rewriting LLM to recover their degraded prompt-image alignment, where we observe improvements in both prompt-image alignment and image quality.

翻译：文本到图像（T2I）扩散模型在给定文本提示下生成高质量图像方面已展现出令人印象深刻的能力。然而，确保提示与图像的对应关系（即生成在语义上忠实遵循提示的图像）仍然是一个重大挑战。近期研究尝试通过优化潜在编码来提高忠实度，但这可能导致潜在编码偏离分布，从而生成不真实的图像。本文提出FRAP，一种基于自适应调整每个提示词权重以提高生成图像的提示-图像对应关系与真实性的简单而有效的方法。我们设计了一种在线算法来自适应更新每个词的权重系数，这是通过最小化一个统一的目标函数来实现的，该函数鼓励对象的存在以及对象-修饰词对的绑定。通过广泛评估，我们表明FRAP在复杂数据集提示下生成的图像具有显著更高的提示-图像对应关系，同时与最近的潜在编码优化方法（例如在COCO-Subject数据集上比D&B快4秒）相比具有更低的平均延迟。此外，通过视觉比较和在CLIP-IQA-Real指标上的评估，我们证明FRAP不仅提高了提示-图像对应关系，还能生成具有更真实外观的图像。我们还探索了将FRAP与提示重写LLM相结合，以恢复其退化的提示-图像对应关系，在此过程中我们观察到提示-图像对应关系和图像质量均得到改善。