Diffusion models have shown unprecedented success in the task of text-to-image generation. While these models are capable of generating high-quality and realistic images, the complexity of sequential denoising has raised societal concerns regarding high computational demands and energy consumption. In response, various efforts have been made to improve inference efficiency. However, most of the existing efforts have taken a fixed approach with neural network simplification or text prompt optimization. Are the quality improvements from all denoising computations equally perceivable to humans? We observed that images from different text prompts may require different computational efforts given the desired content. The observation motivates us to present BudgetFusion, a novel model that suggests the most perceptually efficient number of diffusion steps before a diffusion model starts to generate an image. This is achieved by predicting multi-level perceptual metrics relative to diffusion steps. With the popular Stable Diffusion as an example, we conduct both numerical analyses and user studies. Our experiments show that BudgetFusion saves up to five seconds per prompt without compromising perceptual similarity. We hope this work can initiate efforts toward answering a core question: how much do humans perceptually gain from images created by a generative model, per watt of energy?
翻译:扩散模型在文本到图像生成任务中展现出前所未有的成功。尽管这些模型能够生成高质量且逼真的图像,但其序列去噪过程的复杂性引发了社会对其高计算需求和能耗的担忧。为此,学界已做出多种努力以提升推理效率。然而,现有工作大多采用固定方法,如简化神经网络或优化文本提示。所有去噪计算带来的质量提升对人类而言是否具有同等的可感知性?我们观察到,给定期望内容时,不同文本提示生成的图像可能需要不同的计算量。这一观察促使我们提出BudgetFusion——一种新颖的模型,它能在扩散模型开始生成图像前,建议最具感知效率的扩散步数。这是通过预测与扩散步数相关的多层级感知度量来实现的。以流行的Stable Diffusion为例,我们进行了数值分析和用户研究。实验表明,BudgetFusion在保持感知相似度的同时,可为每个提示节省多达五秒的生成时间。我们希望这项工作能推动学界开始探讨一个核心问题:生成模型每消耗一瓦特能量所产生的图像,人类在感知上能获得多少收益?