Video generation aims to produce temporally coherent sequences of visual frames, representing a pivotal advancement in Artificial Intelligence Generated Content (AIGC). Compared to static image generation, video generation poses unique challenges: it demands not only high-quality individual frames but also strong temporal coherence to ensure consistency throughout the spatiotemporal sequence. Although research addressing spatiotemporal consistency in video generation has increased in recent years, systematic reviews focusing on this core issue remain relatively scarce. To fill this gap, this paper views the video generation task as a sequential sampling process from a high-dimensional spatiotemporal distribution, and further discusses spatiotemporal consistency. We provide a systematic review of the latest advancements in the field. The content spans multiple dimensions including generation models, feature representations, generation frameworks, post-processing techniques, training strategies, benchmarks and evaluation metrics, with a particular focus on the mechanisms and effectiveness of various methods in maintaining spatiotemporal consistency. Finally, this paper explores future research directions and potential challenges in this field, aiming to provide valuable insights for advancing video generation technology. The project link is https://github.com/Yin-Z-Y/A-Survey-Spatiotemporal-Consistency-in-Video-Generation.
翻译:视频生成旨在产生具有时间连贯性的视觉帧序列,代表了人工智能生成内容(AIGC)领域的关键进展。与静态图像生成相比,视频生成面临独特的挑战:它不仅要求生成高质量的单帧图像,还需要强大的时间连贯性以确保整个时空序列的一致性。尽管近年来针对视频生成中时空一致性的研究有所增加,但聚焦于这一核心问题的系统性综述仍相对匮乏。为填补这一空白,本文将视频生成任务视为从高维时空分布中进行序列采样的过程,并进一步探讨时空一致性问题。我们对该领域的最新进展进行了系统性综述。内容涵盖生成模型、特征表示、生成框架、后处理技术、训练策略、基准测试与评估指标等多个维度,特别关注各类方法在维持时空一致性方面的机制与有效性。最后,本文探讨了该领域未来的研究方向与潜在挑战,旨在为推进视频生成技术提供有价值的见解。项目链接为 https://github.com/Yin-Z-Y/A-Survey-Spatiotemporal-Consistency-in-Video-Generation。