Despite their impressive capabilities, diffusion-based text-to-image (T2I) models can lack faithfulness to the text prompt, where generated images may not contain all the mentioned objects, attributes or relations. To alleviate these issues, recent works proposed post-hoc methods to improve model faithfulness without costly retraining, by modifying how the model utilizes the input prompt. In this work, we take a step back and show that large T2I diffusion models are more faithful than usually assumed, and can generate images faithful to even complex prompts without the need to manipulate the generative process. Based on that, we show how faithfulness can be simply treated as a candidate selection problem instead, and introduce a straightforward pipeline that generates candidate images for a text prompt and picks the best one according to an automatic scoring system that can leverage already existing T2I evaluation metrics. Quantitative comparisons alongside user studies on diverse benchmarks show consistently improved faithfulness over post-hoc enhancement methods, with comparable or lower computational cost. Code is available at \url{https://github.com/ExplainableML/ImageSelect}.
翻译:尽管扩散式文本到图像(T2I)模型能力令人印象深刻,但其对文本提示的忠实度可能不足,生成的图像可能无法包含所有提及的对象、属性或关系。为缓解这些问题,近期研究提出了事后(post-hoc)方法,通过修改模型利用输入提示的方式来提升忠实度,而无需代价高昂的重新训练。在本工作中,我们退一步审视,发现大型T2I扩散模型比通常假设的更为忠实,且无需操控生成过程即可生成甚至对复杂提示也保持忠实的图像。基于此,我们展示了如何将忠实度简单视为候选选择问题,并引入了一个直接流程:为文本提示生成候选图像,并根据能够利用现有T2I评估指标的自动评分系统选取最优图像。在多样化基准上的定量比较及用户研究显示,与事后增强方法相比,该方法持续提升了忠实度,且计算成本相当或更低。代码见 \url{https://github.com/ExplainableML/ImageSelect}。