CausalDrive: Real-time Causal World Models for Autonomous Driving

Tianyi Yan,Huan Zheng,Dubing Chen,Meizhi Qu,Yingying Shen,Lijun Zhou,Mingfei Tu,Bing Wang,Guang Chen,Hangjun Ye,Haiyang Sun,Cheng-zhong Xu,Jianbing Shen

World models have emerged as a promising paradigm for scaling autonomous driving (AD) data, yet existing video generative models fall short as interactive simulators. Layout-conditioned renderers rely on "oracle" future trajectories of all background agents, rendering them strictly non-reactive. Conversely, pure action-conditioned predictors lack semantic control over complex interactions and suffer from prohibitive diffusion latencies, hindering closed-loop policy learning. To bridge this gap, we present CausalDrive, a controllable, real-time foundation driving world renderer. CausalDrive operates solely on the initial front-view frame, the ego-vehicle's trajectory, and a macroscopic text prompt. By excluding future NPC layouts, we compel the model to intrinsically predict causal interactions, enabling text-driven control over Driving Sociology, allowing users to dynamically orchestrate diverse counterfactual reactions to identical ego-actions. To overcome the efficiency bottleneck and address the covariate shift in autoregressive generation, we propose a novel Context-Forced DMD architecture. This combines continuous flow-matching with a self-correcting distillation objective, achieving interactive speeds of 12 FPS. This breakthrough transforms the passive video generator into a playable neural simulator. We demonstrate its versatility across three downstream applications: (1) generative closed-loop evaluation with significantly mitigated collision artifacts, (2) large-scale Reinforcement Learning (RL) post-training driven by a Video2Reward module, and (3) real-time human-in-the-loop simulation. Extensive experiments validate that policies trained within CausalDrive's reactive scenarios exhibit superior interaction capabilities in the real world.

翻译：世界模型已成为扩展自动驾驶（AD）数据的一种有前景的范式，然而现有的视频生成模型作为交互式模拟器仍存在不足。布局条件渲染器依赖于所有背景智能体的“预言式”未来轨迹，使其严格不可反应。相反，纯动作条件预测器缺乏对复杂交互的语义控制，并且受到推理延迟过高的限制，阻碍了闭环策略学习。为填补这一空白，我们提出了CausalDrive，一种可控、实时的基础驾驶世界渲染器。CausalDrive仅依赖于初始前视图帧、自车轨迹以及宏观文本提示。通过排除未来的非玩家角色（NPC）布局，我们迫使模型内在地预测因果交互，从而实现对驾驶社会学的文本驱动控制，允许用户针对相同的自车动作动态编排多样化的反事实反应。为克服效率瓶颈并解决自回归生成中的协变量偏移问题，我们提出了一种新颖的上下文强制DMD架构。该架构将连续流匹配与自修正蒸馏目标相结合，实现了12 FPS的交互速度。这一突破将被动的视频生成器转变为可玩的神经模拟器。我们在三个下游应用中展示了其多功能性：（1）生成式闭环评估，显著减少了碰撞伪影；（2）由Video2Reward模块驱动的大规模强化学习（RL）后训练；（3）实时人在环仿真。大量实验证明，在CausalDrive的反应式场景中训练的策略在现实世界中表现出更优越的交互能力。