The evolution of video generation from text, from animating MNIST to simulating the world with Sora, has progressed at a breakneck speed. Here, we systematically discuss how far text-to-video generation technology supports essential requirements in world modeling. We curate 250+ studies on text-based video synthesis and world modeling. We then observe that recent models increasingly support spatial, action, and strategic intelligences in world modeling through adherence to completeness, consistency, invention, as well as human interaction and control. We conclude that text-to-video generation is adept at world modeling, although homework in several aspects, such as the diversity-consistency trade-offs, remains to be addressed.
翻译:从动画化 MNIST 到使用 Sora 模拟世界,基于文本的视频生成技术正以惊人的速度演进。本文系统性地探讨了文本到视频生成技术在多大程度上支持世界建模的核心要求。我们整理了 250 余项关于基于文本的视频合成与世界建模的研究,并观察到近期模型通过遵循完整性、一致性、创造性以及人机交互与控制等原则,正日益支持世界建模中的空间智能、行为智能与策略智能。我们的结论是:文本到视频生成技术已胜任世界建模任务,但在多样性-一致性权衡等多个方面仍有待完善。