The text-to-image synthesis by diffusion models has recently shown remarkable performance in generating high-quality images. Although performs well for simple texts, the models may get confused when faced with complex texts that contain multiple objects or spatial relationships. To get the desired images, a feasible way is to manually adjust the textual descriptions, i.e., narrating the texts or adding some words, which is labor-consuming. In this paper, we propose a framework to learn the proper textual descriptions for diffusion models through prompt learning. By utilizing the quality guidance and the semantic guidance derived from the pre-trained diffusion model, our method can effectively learn the prompts to improve the matches between the input text and the generated images. Extensive experiments and analyses have validated the effectiveness of the proposed method.
翻译:扩散模型在文本到图像合成领域近期展现出生成高质量图像的卓越性能。尽管这类模型在处理简单文本时表现良好,但面对包含多个物体或空间关系的复杂文本时可能会产生混淆。为获得期望图像,一种可行方式是手动调整文本描述(如改写文本或添加词语),但这需要耗费大量人力。本文提出一个框架,通过提示学习为扩散模型习得恰当的文本描述。利用预训练扩散模型提供的质量引导与语义引导,该方法能有效学习提示词以提升输入文本与生成图像间的匹配度。大量实验与分析验证了所提方法的有效性。