Bridging Scene Generation and Planning: Driving with World Model via Unifying Vision and Motion Representation

End-to-end autonomous driving aims to generate safe and plausible planning policies from raw sensor input. Driving world models have shown great potential in learning rich representations by predicting the future evolution of a driving scene. However, existing driving world models primarily focus on visual scene representation, and motion representation is not explicitly designed to be planner-shared and inheritable, leaving a schism between the optimization of visual scene generation and the requirements of precise motion planning. We present WorldDrive, a holistic framework that couples scene generation and real-time planning via unifying vision and motion representation. We first introduce a Trajectory-aware Driving World Model, which conditions on a trajectory vocabulary to enforce consistency between visual dynamics and motion intentions, enabling the generation of diverse and plausible future scenes conditioned on a specific trajectory. We transfer the vision and motion encoders to a downstream Multi-modal Planner, ensuring the driving policy operates on mature representations pre-optimized by scene generation. A simple interaction between motion representation, visual representation, and ego status can generate high-quality, multi-modal trajectories. Furthermore, to exploit the world model's foresight, we propose a Future-aware Rewarder, which distills future latent representation from the frozen world model to evaluate and select optimal trajectories in real-time. Extensive experiments on the NAVSIM, NAVSIM-v2, and nuScenes benchmarks demonstrate that WorldDrive achieves leading planning performance among vision-only methods while maintaining high-fidelity action-controlled video generation capabilities, providing strong evidence for the effectiveness of unifying vision and motion representation for robust autonomous driving.

翻译：端到端自动驾驶旨在从原始传感器输入中生成安全且合理的规划策略。驾驶世界模型通过预测驾驶场景的未来演变，在学习丰富表征方面展现出巨大潜力。然而，现有驾驶世界模型主要关注视觉场景表征，其运动表征并未被显式设计为规划器共享且可继承的，导致视觉场景生成的优化与精确运动规划的需求之间存在割裂。我们提出WorldDrive，一个通过统一视觉与运动表征来耦合场景生成与实时规划的整体框架。我们首先引入轨迹感知驾驶世界模型，该模型以轨迹词汇表为条件，强制视觉动态与运动意图之间的一致性，从而能够基于特定轨迹生成多样且合理的未来场景。我们将视觉与运动编码器迁移至下游多模态规划器，确保驾驶策略在经场景生成预优化的成熟表征上运行。运动表征、视觉表征与自车状态之间的简单交互即可生成高质量的多模态轨迹。此外，为利用世界模型的预见能力，我们提出未来感知奖励器，其从冻结的世界模型中蒸馏未来潜在表征，以实时评估并选择最优轨迹。在NAVSIM、NAVSIM-v2和nuScenes基准上的大量实验表明，WorldDrive在纯视觉方法中实现了领先的规划性能，同时保持了高保真度的动作控制视频生成能力，为统一视觉与运动表征以实现鲁棒自动驾驶的有效性提供了有力证据。