Safe L2/L3 driving automation requires anticipating human-in-the-loop reactions during shared-control transitions. While most driving world models forecast the external environment, in-cabin intelligence remains strictly recognition-oriented and lacks multi-step rollout capabilities for driver dynamics. We introduce Driver-WM, a driver-centric latent world model that rolls out in-cabin dynamics causally conditioned on out-cabin traffic context. This formulation unifies physical kinematics forecasting with auxiliary behavioral and emotional semantic recognition. Operating in a compact latent space constructed from frozen vision-language features, Driver-WM adopts a dual-stream architecture to separately encode external traffic and internal driver states. These streams are directionally coupled via a gated causal injection mechanism, which uses a learned vector gate to modulate external contextual perturbations while strictly enforcing temporal causality. Evaluations on a multi-task assistive driving benchmark demonstrate that Driver-WM yields robust long-horizon geometric forecasting for reactive high-motion maneuvers and improves semantic alignment for both driver and traffic states. Finally, the explicit external-to-internal conditioning allows for controlled test-time interventions to systematically analyze mechanism responses.
翻译:安全的L2/L3级驾驶自动化需要预测共享控制切换过程中的人类在环反应。尽管大多数驾驶世界模型侧重于预测外部环境,但座舱智能仍严格局限于识别层面,缺乏对驾驶员动态的多步推演能力。我们提出Driver-WM——一种以驾驶员为核心的潜在世界模型,能够基于外部交通上下文因果性地推演座舱动态。该框架将物理运动学预测与辅助性行为及情感语义识别统一起来。在由冻结视觉语言特征构建的紧凑潜在空间中运行,Driver-WM采用双流架构分别编码外部交通状态与内部驾驶员状态。这两类状态通过门控因果注入机制实现方向性耦合——该机制利用学习型向量门控调节外部上下文扰动,同时严格强制时序因果性。在多任务辅助驾驶基准上的评估表明,Driver-WM对高动态机动反应能产生鲁棒的长时域几何预测,并改善驾驶员与交通状态的语义对齐效果。最终,显式的外-内条件约束允许受控的测试时干预,以系统性地分析机制响应。