Diffusion models have recently brought a powerful revolution in image generation. Despite showing impressive generative capabilities, most of these models rely on the current sample to denoise the next one, possibly resulting in denoising instability. In this paper, we reinterpret the iterative denoising process as model optimization and leverage a moving average mechanism to ensemble all the prior samples. Instead of simply applying moving average to the denoised samples at different timesteps, we first map the denoised samples to data space and then perform moving average to avoid distribution shift across timesteps. In view that diffusion models evolve the recovery from low-frequency components to high-frequency details, we further decompose the samples into different frequency components and execute moving average separately on each component. We name the complete approach "Moving Average Sampling in Frequency domain (MASF)". MASF could be seamlessly integrated into mainstream pre-trained diffusion models and sampling schedules. Extensive experiments on both unconditional and conditional diffusion models demonstrate that our MASF leads to superior performances compared to the baselines, with almost negligible additional complexity cost.
翻译:扩散模型近期在图像生成领域引发了重大革命。尽管这些模型展现出令人瞩目的生成能力,但大多数模型依赖当前样本对下一步进行去噪,可能导致去噪不稳定性。本文将迭代式去噪过程重新解读为模型优化,并利用滑动平均机制集成所有先验样本。不同于简单地对不同时间步长的去噪样本直接应用滑动平均,我们首先将去噪样本映射到数据空间,然后执行滑动平均以避免跨时间步长的分布偏移。鉴于扩散模型遵循从低频分量到高频细节的渐进式恢复规律,我们进一步将样本分解为不同频率分量,并对每个分量分别执行滑动平均。我们将完整方法命名为"频域滑动平均采样(MASF)"。MASF可无缝集成到主流预训练扩散模型与采样策略中。在无条件和条件扩散模型上的大量实验表明,相较于基线方法,我们的MASF在几乎不增加额外复杂度的情况下取得了更优的性能。