Text-to-video generation marks a significant frontier in the rapidly evolving domain of generative AI, integrating advancements in text-to-image synthesis, video captioning, and text-guided editing. This survey critically examines the progression of text-to-video technologies, focusing on the shift from traditional generative models to the cutting-edge Sora model, highlighting developments in scalability and generalizability. Distinguishing our analysis from prior works, we offer an in-depth exploration of the technological frameworks and evolutionary pathways of these models. Additionally, we delve into practical applications and address ethical and technological challenges such as the inability to perform multiple entity handling, comprehend causal-effect learning, understand physical interaction, perceive object scaling and proportioning, and combat object hallucination which is also a long-standing problem in generative models. Our comprehensive discussion covers the topic of enablement of text-to-video generation models as human-assistive tools and world models, as well as eliciting model's shortcomings and summarizing future improvement direction that mainly centers around training datasets and evaluation metrics (both automatic and human-centered). Aimed at both newcomers and seasoned researchers, this survey seeks to catalyze further innovation and discussion in the growing field of text-to-video generation, paving the way for more reliable and practical generative artificial intelligence technologies.
翻译:文本到视频生成标志着生成式AI领域快速演进中的一个重要前沿,它融合了文本到图像合成、视频字幕生成和文本引导编辑等技术的进步。本综述批判性地审视了文本到视频技术的发展历程,重点关注从传统生成模型向尖端Sora模型的转变,并强调了可扩展性和泛化性方面的进展。与先前的研究不同,我们深入探讨了这些模型的技术框架和演化路径。此外,我们深入分析了实际应用,并讨论了伦理和技术挑战,例如无法处理多实体、理解因果效应学习、感知物理交互、把握物体缩放与比例,以及解决长期存在于生成模型中的物体幻觉问题。我们的全面讨论涵盖了将文本到视频生成模型作为人类辅助工具和世界模型的应用主题,同时指出了模型的局限性,并总结了主要围绕训练数据集和评估指标(包括自动评估和以人为中心的评估)的未来改进方向。本综述旨在面向新手和经验丰富的研究人员,促进文本到视频生成这一新兴领域的进一步创新和讨论,为更可靠、更实用的生成式人工智能技术铺平道路。