Text-to-video (T2V) generation technology holds potential to transform multiple domains such as education, marketing, entertainment, and assistive technologies for individuals with visual or reading comprehension challenges, by creating coherent visual content from natural language prompts. From its inception, the field has advanced from adversarial models to diffusion-based models, yielding higher-fidelity, temporally consistent outputs. Yet challenges persist, such as alignment, long-range coherence, and computational efficiency. Addressing this evolving landscape, we present a comprehensive survey of text-to-video generative models, tracing their development from early GANs and VAEs to hybrid Diffusion-Transformer (DiT) architectures, detailing how these models work, what limitations they addressed in their predecessors, and why shifts toward new architectural paradigms were necessary to overcome challenges in quality, coherence, and control. We provide a systematic account of the datasets, which the surveyed text-to-video models were trained and evaluated on, and, to support reproducibility and assess the accessibility of training such models, we detail their training configurations, including their hardware specifications, GPU counts, batch sizes, learning rates, optimizers, epochs, and other key hyperparameters. Further, we outline the evaluation metrics commonly used for evaluating such models and present their performance across standard benchmarks, while also discussing the limitations of these metrics and the emerging shift toward more holistic, perception-aligned evaluation strategies. Finally, drawing from our analysis, we outline the current open challenges and propose a few promising future directions, laying out a perspective for future researchers to explore and build upon in advancing T2V research and applications.
翻译:文本到视频(T2V)生成技术具有通过自然语言提示创建连贯视觉内容的潜力,有望变革教育、营销、娱乐以及为有视觉或阅读理解障碍人士提供辅助技术等多个领域。自诞生以来,该领域已从对抗模型发展到基于扩散的模型,产生了保真度更高、时间一致性更强的输出。然而,诸如对齐、长程连贯性和计算效率等挑战依然存在。针对这一不断发展的格局,我们对文本到视频生成模型进行了全面综述,追溯了从早期GAN和VAE到混合扩散-Transformer(DiT)架构的发展历程,详细阐述了这些模型的工作原理、它们解决了前代模型的哪些局限性,以及为何需要转向新的架构范式以克服质量、连贯性和控制方面的挑战。我们系统性地梳理了所综述文本到视频模型训练和评估所用的数据集;为了支持可复现性并评估训练此类模型的可及性,我们详细说明了它们的训练配置,包括硬件规格、GPU数量、批量大小、学习率、优化器、训练轮数及其他关键超参数。此外,我们概述了评估此类模型常用的评估指标,并展示了它们在标准基准测试上的性能,同时也讨论了这些指标的局限性以及向更全面、与感知对齐的评估策略转变的新趋势。最后,基于我们的分析,我们概述了当前存在的开放挑战,并提出了几个有前景的未来研究方向,为未来研究者在推进T2V研究和应用方面的探索与拓展提供了前瞻性视角。