How can we reliably simulate future driving scenarios under a wide range of ego driving behaviors? Recent driving world models, developed exclusively on real-world driving data composed mainly of safe expert trajectories, struggle to follow hazardous or non-expert behaviors, which are rare in such data. This limitation restricts their applicability to tasks such as policy evaluation. In this work, we address this challenge by enriching real-world human demonstrations with diverse non-expert data collected from a driving simulator (e.g., CARLA), and building a controllable world model trained on this heterogeneous corpus. Starting with a video generator featuring a diffusion transformer architecture, we devise several strategies to effectively integrate conditioning signals and improve prediction controllability and fidelity. The resulting model, ReSim, enables Reliable Simulation of diverse open-world driving scenarios under various actions, including hazardous non-expert ones. To close the gap between high-fidelity simulation and applications that require reward signals to judge different actions, we introduce a Video2Reward module that estimates a reward from ReSim's simulated future. Our ReSim paradigm achieves up to 44% higher visual fidelity, improves controllability for both expert and non-expert actions by over 50%, and boosts planning and policy selection performance on NAVSIM by 2% and 25%, respectively.
翻译:如何在不同自车驾驶行为下可靠地模拟未来驾驶场景?近期基于真实驾驶数据(主要由安全专家轨迹构成)开发的驾驶世界模型,难以模拟数据中稀缺的危险或非专家行为。这一局限性限制了其在策略评估等任务中的应用。本研究通过从驾驶模拟器(如CARLA)中采集多样化非专家数据以丰富真实人类示范,并基于异构语料库构建可控世界模型来应对上述挑战。我们以采用扩散Transformer架构的视频生成器为基础,设计了多种策略有效整合条件信号,提升预测可控性与保真度。由此产生的ReSim模型能够针对各种动作(包括危险的非专家行为)实现多样化开放世界驾驶场景的可靠模拟。为弥合高保真模拟与需要奖励信号评判不同动作的应用之间的差距,我们引入Video2Reward模块,从ReSim模拟的未来场景中估计奖励。我们的ReSim范式在视觉保真度上提升高达44%,对专家与非专家动作的可控性提升超过50%,并在NAVSIM基准上使规划与策略选择性能分别提升2%和25%。