World models have emerged as a promising paradigm for scaling autonomous driving (AD) data, yet existing video generative models fall short as interactive simulators. Layout-conditioned renderers rely on "oracle" future trajectories of all background agents, rendering them strictly non-reactive. Conversely, pure action-conditioned predictors lack semantic control over complex interactions and suffer from prohibitive diffusion latencies, hindering closed-loop policy learning. To bridge this gap, we present CausalDrive, a controllable, real-time foundation driving world renderer. CausalDrive operates solely on the initial front-view frame, the ego-vehicle's trajectory, and a macroscopic text prompt. By excluding future NPC layouts, we compel the model to intrinsically predict causal interactions, enabling text-driven control over Driving Sociology, allowing users to dynamically orchestrate diverse counterfactual reactions to identical ego-actions. To overcome the efficiency bottleneck and address the covariate shift in autoregressive generation, we propose a novel Context-Forced DMD architecture. This combines continuous flow-matching with a self-correcting distillation objective, achieving interactive speeds of 12 FPS. This breakthrough transforms the passive video generator into a playable neural simulator. We demonstrate its versatility across three downstream applications: (1) generative closed-loop evaluation with significantly mitigated collision artifacts, (2) large-scale Reinforcement Learning (RL) post-training driven by a Video2Reward module, and (3) real-time human-in-the-loop simulation. Extensive experiments validate that policies trained within CausalDrive's reactive scenarios exhibit superior interaction capabilities in the real world.
翻译:世界模型已成为扩展自动驾驶(AD)数据的一种有前景的范式,然而现有的视频生成模型作为交互式模拟器仍存在不足。布局条件渲染器依赖于所有背景智能体的“预言式”未来轨迹,使其严格不可反应。相反,纯动作条件预测器缺乏对复杂交互的语义控制,并且受到推理延迟过高的限制,阻碍了闭环策略学习。为填补这一空白,我们提出了CausalDrive,一种可控、实时的基础驾驶世界渲染器。CausalDrive仅依赖于初始前视图帧、自车轨迹以及宏观文本提示。通过排除未来的非玩家角色(NPC)布局,我们迫使模型内在地预测因果交互,从而实现对驾驶社会学的文本驱动控制,允许用户针对相同的自车动作动态编排多样化的反事实反应。为克服效率瓶颈并解决自回归生成中的协变量偏移问题,我们提出了一种新颖的上下文强制DMD架构。该架构将连续流匹配与自修正蒸馏目标相结合,实现了12 FPS的交互速度。这一突破将被动的视频生成器转变为可玩的神经模拟器。我们在三个下游应用中展示了其多功能性:(1)生成式闭环评估,显著减少了碰撞伪影;(2)由Video2Reward模块驱动的大规模强化学习(RL)后训练;(3)实时人在环仿真。大量实验证明,在CausalDrive的反应式场景中训练的策略在现实世界中表现出更优越的交互能力。