DrivingGen: A Comprehensive Benchmark for Generative Video World Models in Autonomous Driving

Video generation models, as one form of world models, have emerged as one of the most exciting frontiers in AI, promising agents the ability to imagine the future by modeling the temporal evolution of complex scenes. In autonomous driving, this vision gives rise to driving world models: generative simulators that imagine ego and agent futures, enabling scalable simulation, safe testing of corner cases, and rich synthetic data generation. Yet, despite fast-growing research activity, the field lacks a rigorous benchmark to measure progress and guide priorities. Existing evaluations remain limited: generic video metrics overlook safety-critical imaging factors; trajectory plausibility is rarely quantified; temporal and agent-level consistency is neglected; and controllability with respect to ego conditioning is ignored. Moreover, current datasets fail to cover the diversity of conditions required for real-world deployment. To address these gaps, we present DrivingGen, the first comprehensive benchmark for generative driving world models. DrivingGen combines a diverse evaluation dataset curated from both driving datasets and internet-scale video sources, spanning varied weather, time of day, geographic regions, and complex maneuvers, with a suite of new metrics that jointly assess visual realism, trajectory plausibility, temporal coherence, and controllability. Benchmarking 14 state-of-the-art models reveals clear trade-offs: general models look better but break physics, while driving-specific ones capture motion realistically but lag in visual quality. DrivingGen offers a unified evaluation framework to foster reliable, controllable, and deployable driving world models, enabling scalable simulation, planning, and data-driven decision-making.

翻译：视频生成模型作为世界模型的一种形式，已成为人工智能领域最令人兴奋的前沿方向之一，其通过建模复杂场景的时间演化，赋予智能体预测未来的能力。在自动驾驶领域，这一愿景催生了驾驶世界模型：即能够生成自车与交通参与者未来状态的仿真器，从而实现可扩展的模拟、极端场景的安全测试以及丰富的合成数据生成。然而，尽管相关研究快速增长，该领域仍缺乏严谨的基准来衡量进展并指导研究方向。现有评估方法存在明显局限：通用视频指标忽略了安全关键的成像因素；轨迹合理性很少被量化；时间一致性与智能体层面的一致性被忽视；且自车条件控制能力未被考量。此外，当前数据集未能覆盖实际部署所需的各种场景条件。为弥补这些不足，我们提出了DrivingGen——首个面向生成式驾驶世界模型的综合基准。DrivingGen整合了从驾驶数据集和互联网规模视频源中精心构建的多样化评估数据集，涵盖不同天气、昼夜时段、地理区域及复杂驾驶操作，并配套一套全新评估指标，从视觉真实感、轨迹合理性、时间连贯性和可控性四个维度进行联合评估。对14个前沿模型的基准测试揭示了明确的权衡：通用模型视觉效果更佳但违反物理规律，而驾驶专用模型能真实捕捉运动模式却在视觉质量上落后。DrivingGen提供了一个统一的评估框架，旨在推动可靠、可控、可部署的驾驶世界模型的发展，为实现可扩展的仿真、规划及数据驱动决策提供支持。