A diffusion probabilistic model (DPM), which constructs a forward diffusion process by gradually adding noise to data points and learns the reverse denoising process to generate new samples, has been shown to handle complex data distribution. Despite its recent success in image synthesis, applying DPMs to video generation is still challenging due to high-dimensional data spaces. Previous methods usually adopt a standard diffusion process, where frames in the same video clip are destroyed with independent noises, ignoring the content redundancy and temporal correlation. This work presents a decomposed diffusion process via resolving the per-frame noise into a base noise that is shared among all frames and a residual noise that varies along the time axis. The denoising pipeline employs two jointly-learned networks to match the noise decomposition accordingly. Experiments on various datasets confirm that our approach, termed as VideoFusion, surpasses both GAN-based and diffusion-based alternatives in high-quality video generation. We further show that our decomposed formulation can benefit from pre-trained image diffusion models and well-support text-conditioned video creation.
翻译:扩散概率模型(DPM)通过逐步向数据点添加噪声构建前向扩散过程,并学习逆向去噪过程以生成新样本,已被证明能够处理复杂数据分布。尽管该模型在图像合成领域近期取得显著成功,但由于视频数据空间维度较高,将其应用于视频生成仍面临挑战。现有方法通常采用标准扩散过程,使用独立噪声破坏同一视频片段中的各帧,这忽略了内容冗余与时间关联性。本文提出一种分解扩散过程,将每帧噪声分解为所有帧共享的基础噪声和沿时间轴变化的残差噪声。去噪流程采用两个联合训练的网络来匹配相应的噪声分解。在多类数据集上的实验表明,我们提出的方法(命名为VideoFusion)在高质量视频生成方面超越了基于生成对抗网络(GAN)和基于扩散模型的替代方案。我们进一步证明,这种分解公式能受益于预训练的图像扩散模型,并有效支持文本条件驱动的视频创作。