World models and video generation are pivotal technologies in the domain of autonomous driving, each playing a critical role in enhancing the robustness and reliability of autonomous systems. World models, which simulate the dynamics of real-world environments, and video generation models, which produce realistic video sequences, are increasingly being integrated to improve situational awareness and decision-making capabilities in autonomous vehicles. This paper investigates the relationship between these two technologies, focusing on how their structural parallels, particularly in diffusion-based models, contribute to more accurate and coherent simulations of driving scenarios. We examine leading works such as JEPA, Genie, and Sora, which exemplify different approaches to world model design, thereby highlighting the lack of a universally accepted definition of world models. These diverse interpretations underscore the field's evolving understanding of how world models can be optimized for various autonomous driving tasks. Furthermore, this paper discusses the key evaluation metrics employed in this domain, such as Chamfer distance for 3D scene reconstruction and Fr\'echet Inception Distance (FID) for assessing the quality of generated video content. By analyzing the interplay between video generation and world models, this survey identifies critical challenges and future research directions, emphasizing the potential of these technologies to jointly advance the performance of autonomous driving systems. The findings presented in this paper aim to provide a comprehensive understanding of how the integration of video generation and world models can drive innovation in the development of safer and more reliable autonomous vehicles.
翻译:世界模型与视频生成是自动驾驶领域的关键技术,二者在提升自动驾驶系统的鲁棒性与可靠性方面均发挥着至关重要的作用。世界模型用于模拟真实世界环境的动态变化,而视频生成模型则用于生成逼真的视频序列,两者正日益融合,以增强自动驾驶车辆的情境感知与决策能力。本文探究了这两种技术之间的关系,重点关注其结构上的相似性(尤其是在基于扩散的模型中)如何促进对驾驶场景进行更准确、更连贯的仿真。我们考察了JEPA、Genie和Sora等代表性工作,它们展示了世界模型设计的不同路径,从而突显出目前缺乏一个被普遍接受的世界模型定义。这些多样化的解读凸显了该领域对于如何针对不同自动驾驶任务优化世界模型的理解仍在不断演进。此外,本文讨论了该领域采用的关键评估指标,例如用于三维场景重建的Chamfer距离和用于评估生成视频内容质量的Fr\'echet Inception Distance(FID)。通过分析视频生成与世界模型之间的相互作用,本综述识别了关键挑战与未来研究方向,并强调了这些技术共同推动自动驾驶系统性能提升的潜力。本文所呈现的研究结果旨在提供一个全面的理解,阐明视频生成与世界模型的整合如何能够推动更安全、更可靠的自动驾驶车辆的创新发展。