Sora as an AGI World Model? A Complete Survey on Text-to-Video Generation

Text-to-video generation marks a significant frontier in the rapidly evolving domain of generative AI, integrating advancements in text-to-image synthesis, video captioning, and text-guided editing. This survey critically examines the progression of text-to-video technologies, focusing on the shift from traditional generative models to the cutting-edge Sora model, highlighting developments in scalability and generalizability. Distinguishing our analysis from prior works, we offer an in-depth exploration of the technological frameworks and evolutionary pathways of these models. Additionally, we delve into practical applications and address ethical and technological challenges such as the inability to perform multiple entity handling, comprehend causal-effect learning, understand physical interaction, perceive object scaling and proportioning, and combat object hallucination which is also a long-standing problem in generative models. Our comprehensive discussion covers the topic of enablement of text-to-video generation models as human-assistive tools and world models, as well as eliciting model's shortcomings and summarizing future improvement direction that mainly centers around training datasets and evaluation metrics (both automatic and human-centered). Aimed at both newcomers and seasoned researchers, this survey seeks to catalyze further innovation and discussion in the growing field of text-to-video generation, paving the way for more reliable and practical generative artificial intelligence technologies.

翻译：文本到视频生成标志着生成式AI领域快速演进中的一个重要前沿，它融合了文本到图像合成、视频字幕生成和文本引导编辑等技术的进步。本综述批判性地审视了文本到视频技术的发展历程，重点关注从传统生成模型向尖端Sora模型的转变，并强调了可扩展性和泛化性方面的进展。与先前的研究不同，我们深入探讨了这些模型的技术框架和演化路径。此外，我们深入分析了实际应用，并讨论了伦理和技术挑战，例如无法处理多实体、理解因果效应学习、感知物理交互、把握物体缩放与比例，以及解决长期存在于生成模型中的物体幻觉问题。我们的全面讨论涵盖了将文本到视频生成模型作为人类辅助工具和世界模型的应用主题，同时指出了模型的局限性，并总结了主要围绕训练数据集和评估指标（包括自动评估和以人为中心的评估）的未来改进方向。本综述旨在面向新手和经验丰富的研究人员，促进文本到视频生成这一新兴领域的进一步创新和讨论，为更可靠、更实用的生成式人工智能技术铺平道路。

相关内容

MoDELS

关注 46

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/