Recent diffusion probabilistic models (DPMs) have shown remarkable abilities of generated content, however, they often suffer from complex forward processes, resulting in inefficient solutions for the reversed process and prolonged sampling times. In this paper, we aim to address the aforementioned challenges by focusing on the diffusion process itself that we propose to decouple the intricate diffusion process into two comparatively simpler process to improve the generative efficacy and speed. In particular, we present a novel diffusion paradigm named DDM (Decoupled Diffusion Models) based on the Ito diffusion process, in which the image distribution is approximated by an explicit transition probability while the noise path is controlled by the standard Wiener process. We find that decoupling the diffusion process reduces the learning difficulty and the explicit transition probability improves the generative speed significantly. We prove a new training objective for DPM, which enables the model to learn to predict the noise and image components separately. Moreover, given the novel forward diffusion equation, we derive the reverse denoising formula of DDM that naturally supports fewer steps of generation without ordinary differential equation (ODE) based accelerators. Our experiments demonstrate that DDM outperforms previous DPMs by a large margin in fewer function evaluations setting and gets comparable performances in long function evaluations setting. We also show that our framework can be applied to image-conditioned generation and high-resolution image synthesis, and that it can generate high-quality images with only 10 function evaluations.
翻译:近期,扩散概率模型(DPMs)在内容生成方面展现出卓越能力,然而其常因复杂的前向过程导致逆向过程求解效率低下及采样时间过长。本文针对上述挑战,聚焦于扩散过程本身,提出将复杂的扩散过程解耦为两个相对简单的过程,以提升生成效率与速度。具体而言,我们基于伊藤扩散过程提出了一种名为DDM(解耦扩散模型)的新型扩散范式,其中图像分布通过显式转移概率近似,而噪声路径则由标准维纳过程控制。我们发现,解耦扩散过程降低了学习难度,而显式转移概率则显著提升了生成速度。我们为DPM证明了新的训练目标,使模型能够分别学习预测噪声与图像成分。此外,基于新颖的前向扩散方程,我们推导了DDM的逆向去噪公式,该公式天然支持无需基于常微分方程(ODE)加速器的少步数生成。实验表明,DDM在少函数评估场景下显著优于先前DPMs,并在多函数评估场景中取得可比较的性能。我们还展示了该框架可应用于图像条件生成与高分辨率图像合成,且仅需10次函数评估即可生成高质量图像。