Recent diffusion probabilistic models (DPMs) have shown remarkable abilities of generated content, however, they often suffer from complex forward processes, resulting in inefficient solutions for the reversed process and prolonged sampling times. In this paper, we aim to address the aforementioned challenges by focusing on the diffusion process itself that we propose to decouple the intricate diffusion process into two comparatively simpler process to improve the generative efficacy and speed. In particular, we present a novel diffusion paradigm named DDM (Decoupled Diffusion Models) based on the Ito diffusion process, in which the image distribution is approximated by an explicit transition probability while the noise path is controlled by the standard Wiener process. We find that decoupling the diffusion process reduces the learning difficulty and the explicit transition probability improves the generative speed significantly. We prove a new training objective for DPM, which enables the model to learn to predict the noise and image components separately. Moreover, given the novel forward diffusion equation, we derive the reverse denoising formula of DDM that naturally supports fewer steps of generation without ordinary differential equation (ODE) based accelerators. Our experiments demonstrate that DDM outperforms previous DPMs by a large margin in fewer function evaluations setting and gets comparable performances in long function evaluations setting. We also show that our framework can be applied to image-conditioned generation and high-resolution image synthesis, and that it can generate high-quality images with only 10 function evaluations.
翻译:近期,扩散概率模型(DPMs)在内容生成方面展现了卓越能力,然而,其复杂的前向过程常常导致逆向过程求解效率低下,并延长采样时间。本文旨在解决上述挑战,聚焦于扩散过程本身,提出将复杂的扩散过程解耦为两个相对简单的子过程,以提升生成效率与速度。具体而言,我们基于伊藤扩散过程提出了一种名为DDM(解耦扩散模型)的新型扩散范式,其中图像分布通过显式转移概率近似,而噪声路径由标准维纳过程控制。我们发现,解耦扩散过程降低了学习难度,且显式转移概率显著提升了生成速度。我们为扩散概率模型推导了新的训练目标,使模型能够分别学习预测噪声和图像分量。此外,基于新型前向扩散方程,我们推导了DDM的逆向去噪公式,该公式天然支持无需基于常微分方程(ODE)加速器的少步生成。实验表明,在少量函数评估场景下,DDM以显著优势超越先前扩散概率模型,并在大量函数评估场景下取得相当性能。我们还证明,本框架可应用于图像条件生成与高分辨率图像合成,且仅需10次函数评估即可生成高质量图像。