Diffusion-based video generation models have demonstrated remarkable success in obtaining high-fidelity videos through the iterative denoising process. However, these models require multiple denoising steps during sampling, resulting in high computational costs. In this work, we propose a novel approach to obtain single-step video generation models by leveraging adversarial training to fine-tune pre-trained video diffusion models. We show that, through the adversarial training, the multi-steps video diffusion model, i.e., Stable Video Diffusion (SVD), can be trained to perform single forward pass to synthesize high-quality videos, capturing both temporal and spatial dependencies in the video data. Extensive experiments demonstrate that our method achieves competitive generation quality of synthesized videos with significantly reduced computational overhead for the denoising process (i.e., around $23\times$ speedup compared with SVD and $6\times$ speedup compared with existing works, with even better generation quality), paving the way for real-time video synthesis and editing. More visualization results are made publicly available at https://snap-research.github.io/SF-V.
翻译:基于扩散的视频生成模型通过迭代去噪过程在获取高保真视频方面取得了显著成功。然而,这些模型在采样过程中需要多次去噪步骤,导致计算成本高昂。在本工作中,我们提出了一种新颖的方法,通过利用对抗训练对预训练的视频扩散模型进行微调,以获得单步视频生成模型。我们证明,通过对抗训练,多步视频扩散模型(即 Stable Video Diffusion (SVD))能够被训练为执行单次前向传播即可合成高质量视频,同时捕捉视频数据中的时间与空间依赖性。大量实验表明,我们的方法在显著降低去噪过程计算开销(即与 SVD 相比实现约 $23\times$ 加速,与现有工作相比实现 $6\times$ 加速,且生成质量更优)的同时,实现了具有竞争力的合成视频生成质量,为实时视频合成与编辑铺平了道路。更多可视化结果已公开于 https://snap-research.github.io/SF-V。