A diffusion probabilistic model (DPM), which constructs a forward diffusion process by gradually adding noise to data points and learns the reverse denoising process to generate new samples, has been shown to handle complex data distribution. Despite its recent success in image synthesis, applying DPMs to video generation is still challenging due to high-dimensional data spaces. Previous methods usually adopt a standard diffusion process, where frames in the same video clip are destroyed with independent noises, ignoring the content redundancy and temporal correlation. This work presents a decomposed diffusion process via resolving the per-frame noise into a base noise that is shared among all frames and a residual noise that varies along the time axis. The denoising pipeline employs two jointly-learned networks to match the noise decomposition accordingly. Experiments on various datasets confirm that our approach, termed as VideoFusion, surpasses both GAN-based and diffusion-based alternatives in high-quality video generation. We further show that our decomposed formulation can benefit from pre-trained image diffusion models and well-support text-conditioned video creation.
翻译:扩散概率模型(DPM)通过向数据点逐步添加噪声构建前向扩散过程,并学习逆向去噪过程以生成新样本,已被证明能够处理复杂的数据分布。尽管该模型在图像合成领域近期取得了成功,但由于视频数据的高维空间特性,将DPM应用于视频生成仍面临挑战。以往方法通常采用标准扩散过程,即用独立噪声破坏同一视频片段中的帧,忽略了内容的冗余性与时间相关性。本文提出一种分化解扩散过程,将每帧噪声分解为所有帧共享的基噪声与随时间轴变化的残差噪声。去噪管道采用两个联合训练的网络来匹配相应的噪声分解。在多个数据集上的实验证实,我们的方法(称为VideoFusion)在高质量视频生成方面超越了基于生成对抗网络(GAN)和扩散模型的替代方法。我们进一步证明,所提出的分解形式可以受益于预训练的图像扩散模型,并有效支持文本条件驱动的视频创作。