Exploring the Interplay Between Video Generation and World Models in Autonomous Driving: A Survey

World models and video generation are pivotal technologies in the domain of autonomous driving, each playing a critical role in enhancing the robustness and reliability of autonomous systems. World models, which simulate the dynamics of real-world environments, and video generation models, which produce realistic video sequences, are increasingly being integrated to improve situational awareness and decision-making capabilities in autonomous vehicles. This paper investigates the relationship between these two technologies, focusing on how their structural parallels, particularly in diffusion-based models, contribute to more accurate and coherent simulations of driving scenarios. We examine leading works such as JEPA, Genie, and Sora, which exemplify different approaches to world model design, thereby highlighting the lack of a universally accepted definition of world models. These diverse interpretations underscore the field's evolving understanding of how world models can be optimized for various autonomous driving tasks. Furthermore, this paper discusses the key evaluation metrics employed in this domain, such as Chamfer distance for 3D scene reconstruction and Fr\'echet Inception Distance (FID) for assessing the quality of generated video content. By analyzing the interplay between video generation and world models, this survey identifies critical challenges and future research directions, emphasizing the potential of these technologies to jointly advance the performance of autonomous driving systems. The findings presented in this paper aim to provide a comprehensive understanding of how the integration of video generation and world models can drive innovation in the development of safer and more reliable autonomous vehicles.

翻译：世界模型与视频生成是自动驾驶领域的关键技术，二者在提升自动驾驶系统的鲁棒性与可靠性方面均发挥着至关重要的作用。世界模型用于模拟真实世界环境的动态变化，而视频生成模型则用于生成逼真的视频序列，两者正日益融合，以增强自动驾驶车辆的情境感知与决策能力。本文探究了这两种技术之间的关系，重点关注其结构上的相似性（尤其是在基于扩散的模型中）如何促进对驾驶场景进行更准确、更连贯的仿真。我们考察了JEPA、Genie和Sora等代表性工作，它们展示了世界模型设计的不同路径，从而突显出目前缺乏一个被普遍接受的世界模型定义。这些多样化的解读凸显了该领域对于如何针对不同自动驾驶任务优化世界模型的理解仍在不断演进。此外，本文讨论了该领域采用的关键评估指标，例如用于三维场景重建的Chamfer距离和用于评估生成视频内容质量的Fr\'echet Inception Distance（FID）。通过分析视频生成与世界模型之间的相互作用，本综述识别了关键挑战与未来研究方向，并强调了这些技术共同推动自动驾驶系统性能提升的潜力。本文所呈现的研究结果旨在提供一个全面的理解，阐明视频生成与世界模型的整合如何能够推动更安全、更可靠的自动驾驶车辆的创新发展。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日