Diffusion models are the current state-of-the-art in image generation, synthesizing high-quality images by breaking down the generation process into many fine-grained denoising steps. Despite their good performance, diffusion models are computationally expensive, requiring many neural function evaluations (NFEs). In this work, we propose an anytime diffusion-based method that can generate viable images when stopped at arbitrary times before completion. Using existing pretrained diffusion models, we show that the generation scheme can be recomposed as two nested diffusion processes, enabling fast iterative refinement of a generated image. In experiments on ImageNet and Stable Diffusion-based text-to-image generation, we show, both qualitatively and quantitatively, that our method's intermediate generation quality greatly exceeds that of the original diffusion model, while the final generation result remains comparable. We illustrate the applicability of Nested Diffusion in several settings, including for solving inverse problems, and for rapid text-based content creation by allowing user intervention throughout the sampling process.
翻译:扩散模型是当前图像生成领域的最先进技术,通过将生成过程分解为大量精细的降噪步骤来合成高质量图像。尽管性能优异,但扩散模型计算成本高昂,需要大量神经函数评估。本文提出一种基于扩散的任意时刻方法,可在完成前任意时刻停止时生成可行图像。利用现有预训练扩散模型,我们证明生成方案可重组为两个嵌套扩散过程,从而实现生成图像的快速迭代优化。在ImageNet和基于Stable Diffusion的文本到图像生成实验中,我们从定性和定量两方面表明,我们方法的中间生成质量远超原始扩散模型,同时最终生成结果保持可比性。我们展示了嵌套扩散在若干场景中的适用性,包括解决逆问题,以及通过在采样过程中允许用户干预实现基于文本的快速内容创作。