Simulating the Visual World with Artificial Intelligence: A Roadmap

from arxiv, Project page: https://world-model-roadmap.github.io/ Github Repo: https://github.com/ziqihuangg/Awesome-From-Video-Generation-to-World-Model

The landscape of video generation is shifting, from a focus on generating visually appealing clips to building virtual environments that support interaction and maintain physical plausibility. These developments point toward the emergence of video foundation models that function not only as visual generators but also as implicit world models, models that simulate the physical dynamics, agent-environment interactions, and task planning that govern real or imagined worlds. This survey provides a systematic overview of this evolution, conceptualizing modern video foundation models as the combination of two core components: an implicit world model and a video renderer. The world model encodes structured knowledge about the world, including physical laws, interaction dynamics, and agent behavior. It serves as a latent simulation engine that enables coherent visual reasoning, long-term temporal consistency, and goal-driven planning. The video renderer transforms this latent simulation into realistic visual observations, effectively producing videos as a "window" into the simulated world. We trace the progression of video generation through four generations, in which the core capabilities advance step by step, ultimately culminating in a world model, built upon a video generation model, that embodies intrinsic physical plausibility, real-time multimodal interaction, and planning capabilities spanning multiple spatiotemporal scales. For each generation, we define its core characteristics, highlight representative works, and examine their application domains such as robotics, autonomous driving, and interactive gaming. Finally, we discuss open challenges and design principles for next-generation world models, including the role of agent intelligence in shaping and evaluating these systems. An up-to-date list of related works is maintained at this link.

翻译：视频生成领域正在发生转变：从关注生成视觉上吸引人的片段，转向构建支持交互并保持物理合理性的虚拟环境。这些发展指向了视频基础模型的兴起，这些模型不仅作为视觉生成器，还作为隐式世界模型——模拟物理动力学、智能体-环境交互以及支配真实或想象世界的任务规划的模型。本综述系统性地概述了这一演变，将现代视频基础模型概念化为两个核心组件的结合：一个隐式世界模型和一个视频渲染器。世界模型编码关于世界的结构化知识，包括物理定律、交互动力学和智能体行为。它作为一个潜在的模拟引擎，支持连贯的视觉推理、长期时间一致性和目标驱动的规划。视频渲染器将这种潜在模拟转化为逼真的视觉观察，有效地将视频作为模拟世界的“窗口”产生出来。我们追溯了视频生成经历的四个代际发展，其核心能力逐步提升，最终演化为一个构建于视频生成模型之上的世界模型，该模型体现了内在的物理合理性、实时多模态交互以及跨越多个时空尺度的规划能力。针对每一代，我们定义了其核心特征，重点介绍了代表性工作，并探讨了它们在机器人学、自动驾驶和交互式游戏等领域的应用。最后，我们讨论了下一代世界模型面临的开放挑战和设计原则，包括智能体智能在塑造和评估这些系统中的作用。相关工作的最新列表维护于此链接。