Recent diffusion probabilistic models (DPMs) have shown remarkable abilities of generated content, however, they often suffer from complex forward processes, resulting in inefficient solutions for the reversed process and prolonged sampling times. In this paper, we aim to address the aforementioned challenges by focusing on the diffusion process itself that we propose to decouple the intricate diffusion process into two comparatively simpler process to improve the generative efficacy and speed. In particular, we present a novel diffusion paradigm named DDM (Decoupled Diffusion Models) based on the Ito diffusion process, in which the image distribution is approximated by an explicit transition probability while the noise path is controlled by the standard Wiener process. We find that decoupling the diffusion process reduces the learning difficulty and the explicit transition probability improves the generative speed significantly. We prove a new training objective for DPM, which enables the model to learn to predict the noise and image components separately. Moreover, given the novel forward diffusion equation, we derive the reverse denoising formula of DDM that naturally supports fewer steps of generation without ordinary differential equation (ODE) based accelerators. Our experiments demonstrate that DDM outperforms previous DPMs by a large margin in fewer function evaluations setting and gets comparable performances in long function evaluations setting. We also show that our framework can be applied to image-conditioned generation and high-resolution image synthesis, and that it can generate high-quality images with only 10 function evaluations.
翻译:近年来,扩散概率模型(DPM)在生成内容方面展现出卓越能力,但其前向过程往往复杂,导致反向过程求解效率低下且采样时间冗长。本文聚焦于扩散过程本身,提出将复杂的扩散过程解耦为两个相对简单的子过程,旨在提升生成效率与速度。具体而言,我们基于伊藤扩散过程提出了一种名为DDM(解耦扩散模型)的新型扩散范式,其中图像分布通过显式转移概率逼近,而噪声路径则由标准维纳过程控制。研究发现,解耦扩散过程能够降低学习难度,且显式转移概率显著提升了生成速度。我们为DPM证明了新的训练目标,使模型能够分别学习预测噪声分量和图像分量。此外,基于新颖的前向扩散方程,我们推导了DDM的反向去噪公式,该公式天然支持在无需基于常微分方程(ODE)加速器的情况下实现更少步数的生成。实验表明,DDM在少量函数评估场景下大幅优于先前DPM,在长时间函数评估场景下表现相当。我们还证明,该框架可应用于条件图像生成与高分辨率图像合成,且仅需10次函数评估即可生成高质量图像。