Consistency models have demonstrated powerful capability in efficient image generation and allowed synthesis within a few sampling steps, alleviating the high computational cost in diffusion models. However, the consistency model in the more challenging and resource-consuming video generation is still less explored. In this report, we present the VideoLCM framework to fill this gap, which leverages the concept of consistency models from image generation to efficiently synthesize videos with minimal steps while maintaining high quality. VideoLCM builds upon existing latent video diffusion models and incorporates consistency distillation techniques for training the latent consistency model. Experimental results reveal the effectiveness of our VideoLCM in terms of computational efficiency, fidelity and temporal consistency. Notably, VideoLCM achieves high-fidelity and smooth video synthesis with only four sampling steps, showcasing the potential for real-time synthesis. We hope that VideoLCM can serve as a simple yet effective baseline for subsequent research. The source code and models will be publicly available.
翻译:一致性模型在高效图像生成中展现出强大能力,可通过少量采样步骤实现合成,从而缓解扩散模型中高昂的计算成本。然而,在更具挑战性且资源消耗更大的视频生成领域中,一致性模型的研究仍相对不足。本报告提出了VideoLCM框架以填补这一空白,该框架借鉴图像生成中一致性模型的概念,在保持高质量的同时,以最少步骤高效合成视频。VideoLCM基于现有潜在视频扩散模型,并引入一致性蒸馏技术来训练潜在一致性模型。实验结果表明,VideoLCM在计算效率、保真度和时间一致性方面具有有效性。值得注意的是,仅需四个采样步骤,VideoLCM即可实现高保真且平滑的视频合成,展现了实时合成的潜力。我们希望VideoLCM能为后续研究提供一个简单而有效的基线。源代码和模型将公开发布。