Survey of Video Diffusion Models: Foundations, Implementations, and Applications

Recent advances in diffusion models have revolutionized video generation, offering superior temporal consistency and visual quality compared to traditional generative adversarial networks-based approaches. While this emerging field shows tremendous promise in applications, it faces significant challenges in motion consistency, computational efficiency, and ethical considerations. This survey provides a comprehensive review of diffusion-based video generation, examining its evolution, technical foundations, and practical applications. We present a systematic taxonomy of current methodologies, analyze architectural innovations and optimization strategies, and investigate applications across low-level vision tasks such as denoising and super-resolution. Additionally, we explore the synergies between diffusionbased video generation and related domains, including video representation learning, question answering, and retrieval. Compared to the existing surveys (Lei et al., 2024a;b; Melnik et al., 2024; Cao et al., 2023; Xing et al., 2024c) which focus on specific aspects of video generation, such as human video synthesis (Lei et al., 2024a) or long-form content generation (Lei et al., 2024b), our work provides a broader, more updated, and more fine-grained perspective on diffusion-based approaches with a special section for evaluation metrics, industry solutions, and training engineering techniques in video generation. This survey serves as a foundational resource for researchers and practitioners working at the intersection of diffusion models and video generation, providing insights into both the theoretical frameworks and practical implementations that drive this rapidly evolving field. A structured list of related works involved in this survey is also available on https://github.com/Eyeline-Research/Survey-Video-Diffusion.

翻译：近年来，扩散模型的进展彻底改变了视频生成领域，相较于传统的基于生成对抗网络的方法，其在时间一致性和视觉质量方面展现出显著优势。尽管这一新兴领域在应用方面展现出巨大潜力，但在运动一致性、计算效率以及伦理考量方面仍面临重大挑战。本综述对基于扩散模型的视频生成进行了全面回顾，审视了其发展历程、技术基础及实际应用。我们对现有方法进行了系统分类，分析了架构创新与优化策略，并探讨了其在去噪、超分辨率等底层视觉任务中的应用。此外，我们还探究了基于扩散的视频生成与相关领域（包括视频表征学习、问答及检索）之间的协同作用。相较于现有综述（Lei等人，2024a；b；Melnik等人，2024；Cao等人，2023；Xing等人，2024c）——这些综述侧重于视频生成的特定方面，如人体视频合成（Lei等人，2024a）或长内容生成（Lei等人，2024b）——我们的工作为基于扩散的方法提供了一个更广泛、更新颖且更细粒度的视角，特别包含了对视频生成中评估指标、行业解决方案及训练工程技术的专门章节。本综述为在扩散模型与视频生成交叉领域工作的研究人员和实践者提供了基础性资源，深入剖析了推动这一快速发展领域的理论框架与工程实现。本综述所涉及的相关文献结构化列表亦发布于 https://github.com/Eyeline-Research/Survey-Video-Diffusion。