Recent studies have demonstrated that the forward diffusion process is crucial for the effectiveness of diffusion models in terms of generative quality and sampling efficiency. We propose incorporating an analytical image attenuation process into the forward diffusion process for high-quality (un)conditioned image generation with significantly fewer denoising steps compared to the vanilla diffusion model requiring thousands of steps. In a nutshell, our method represents the forward image-to-noise mapping as simultaneous \textit{image-to-zero} mapping and \textit{zero-to-noise} mapping. Under this framework, we mathematically derive 1) the training objectives and 2) for the reverse time the sampling formula based on an analytical attenuation function which models image to zero mapping. The former enables our method to learn noise and image components simultaneously which simplifies learning. Importantly, because of the latter's analyticity in the \textit{zero-to-image} sampling function, we can avoid the ordinary differential equation-based accelerators and instead naturally perform sampling with an arbitrary step size. We have conducted extensive experiments on unconditioned image generation, \textit{e.g.}, CIFAR-10 and CelebA-HQ-256, and image-conditioned downstream tasks such as super-resolution, saliency detection, edge detection, and image inpainting. The proposed diffusion models achieve competitive generative quality with much fewer denoising steps compared to the state of the art, thus greatly accelerating the generation speed. In particular, to generate images of comparable quality, our models require only one-twentieth of the denoising steps compared to the baseline denoising diffusion probabilistic models. Moreover, we achieve state-of-the-art performances on the image-conditioned tasks using only no more than 10 steps.
翻译:近期研究表明,前向扩散过程对于扩散模型的生成质量和采样效率至关重要。本文提出在前向扩散过程中引入解析图像衰减过程,以实现高质量(非)条件图像生成,其所需去噪步骤数较传统扩散模型的数千步显著减少。简而言之,我们的方法将前向图像到噪声的映射表示为同步的\textit{图像归零}映射与\textit{零到噪声}映射。在此框架下,我们通过数学推导得到:1)基于建模图像归零映射的解析衰减函数的训练目标;2)反向时间的采样公式。前者使我们的方法能够同时学习噪声和图像分量,从而简化学习过程。更重要的是,由于后者在\textit{零到图像}采样函数中具有解析性,我们可以避免基于常微分方程的加速器,转而自然地实现任意步长的采样。我们在无条件图像生成(如CIFAR-10和CelebA-HQ-256数据集)以及图像条件下游任务(包括超分辨率、显著性检测、边缘检测和图像修复)上进行了大量实验。与现有技术相比,所提出的扩散模型以更少的去噪步骤实现了具有竞争力的生成质量,从而极大加速了生成速度。特别值得注意的是,为生成质量相当的图像,我们的模型仅需基线去噪扩散概率模型二十分之一的去噪步骤。此外,我们在图像条件任务中仅使用不超过10步的采样就达到了最先进的性能水平。