Recent diffusion probabilistic models (DPMs) have shown remarkable abilities of generated content, however, they often suffer from complex forward processes, resulting in inefficient solutions for the reversed process and prolonged sampling times. In this paper, we aim to address the aforementioned challenges by focusing on the diffusion process itself that we propose to decouple the intricate diffusion process into two comparatively simpler process to improve the generative efficacy and speed. In particular, we present a novel diffusion paradigm named DDM (Decoupled Diffusion Models) based on the Ito diffusion process, in which the image distribution is approximated by an explicit transition probability while the noise path is controlled by the standard Wiener process. We find that decoupling the diffusion process reduces the learning difficulty and the explicit transition probability improves the generative speed significantly. We prove a new training objective for DPM, which enables the model to learn to predict the noise and image components separately. Moreover, given the novel forward diffusion equation, we derive the reverse denoising formula of DDM that naturally supports fewer steps of generation without ordinary differential equation (ODE) based accelerators. Our experiments demonstrate that DDM outperforms previous DPMs by a large margin in fewer function evaluations setting and gets comparable performances in long function evaluations setting. We also show that our framework can be applied to image-conditioned generation and high-resolution image synthesis, and that it can generate high-quality images with only 10 function evaluations.
翻译:近期扩散概率模型(DPM)在生成内容方面展现出卓越能力,然而,其前向过程通常较为复杂,导致逆向求解效率低下且采样时间过长。本文聚焦于扩散过程本身,旨在通过将复杂的扩散过程解耦为两个相对简单的过程,以提升生成效率与速度。具体而言,我们提出了一种基于伊藤扩散过程的新型扩散范式——DDM(解耦扩散模型),其中图像分布通过显式转移概率进行逼近,而噪声路径则由标准维纳过程控制。研究发现,解耦扩散过程降低了学习难度,且显式转移概率显著提升了生成速度。我们为DPM证明了一个新的训练目标,使模型能够分别学习预测噪声和图像分量。此外,基于新颖的前向扩散方程,我们推导出DDM的逆向去噪公式,该公式天然支持无需基于常微分方程(ODE)加速器的较少步数生成。实验表明,DDM在较少函数评估设置下大幅优于先前DPM,并在较长函数评估设置下取得可比较性能。我们还展示了该框架可应用于图像条件生成与高分辨率图像合成,且仅需10次函数评估即可生成高质量图像。