Recent diffusion probabilistic models (DPMs) have shown remarkable abilities of generated content, however, they often suffer from complex forward processes, resulting in inefficient solutions for the reversed process and prolonged sampling times. In this paper, we aim to address the aforementioned challenges by focusing on the diffusion process itself that we propose to decouple the intricate diffusion process into two comparatively simpler process to improve the generative efficacy and speed. In particular, we present a novel diffusion paradigm named DDM (Decoupled Diffusion Models) based on the Ito diffusion process, in which the image distribution is approximated by an explicit transition probability while the noise path is controlled by the standard Wiener process. We find that decoupling the diffusion process reduces the learning difficulty and the explicit transition probability improves the generative speed significantly. We prove a new training objective for DPM, which enables the model to learn to predict the noise and image components separately. Moreover, given the novel forward diffusion equation, we derive the reverse denoising formula of DDM that naturally supports fewer steps of generation without ordinary differential equation (ODE) based accelerators. Our experiments demonstrate that DDM outperforms previous DPMs by a large margin in fewer function evaluations setting and gets comparable performances in long function evaluations setting. We also show that our framework can be applied to image-conditioned generation and high-resolution image synthesis, and that it can generate high-quality images with only 10 function evaluations.
翻译:近年来,扩散概率模型在内容生成方面展现出卓越能力,然而这类模型通常因正向过程复杂而导致逆向过程求解效率低下、采样时间过长。本文旨在通过聚焦扩散过程本身来解决上述挑战,提出将复杂扩散过程解耦为两个相对简单的过程,以提升生成效率与速度。具体而言,我们基于伊藤扩散过程提出一种名为DDM(解耦扩散模型)的新型扩散范式,其中图像分布通过显式转移概率进行近似,而噪声路径则由标准维纳过程控制。研究发现,解耦扩散过程降低了学习难度,而显式转移概率则显著提升了生成速度。我们为扩散概率模型推导了新的训练目标,使模型能够分别学习噪声分量与图像分量的预测。此外,基于新型正向扩散方程,我们推导了DDM的逆向去噪公式,该公式天然支持无需基于常微分方程加速器的少步数生成。实验表明,在少量函数评估设置下,DDM显著优于现有扩散概率模型;在长函数评估设置下,DDM取得了可比较的性能。我们还证明,该框架可应用于图像条件生成与高分辨率图像合成,并能以仅10次函数评估生成高质量图像。