End-to-end (E2E) autonomous driving has recently attracted increasing interest in unifying Vision-Language-Action (VLA) with World Models to enhance decision-making and forward-looking imagination. However, existing methods fail to effectively unify future scene evolution and action planning within a single architecture due to inadequate sharing of latent states, limiting the impact of visual imagination on action decisions. To address this limitation, we propose DriveWorld-VLA, a novel framework that unifies world modeling and planning within a latent space by tightly integrating VLA and world models at the representation level, which enables the VLA planner to benefit directly from holistic scene-evolution modeling and reducing reliance on dense annotated supervision. Additionally, DriveWorld-VLA incorporates the latent states of the world model as core decision-making states for the VLA planner, facilitating the planner to assess how candidate actions impact future scene evolution. By conducting world modeling entirely in the latent space, DriveWorld-VLA supports controllable, action-conditioned imagination at the feature level, avoiding expensive pixel-level rollouts. Extensive open-loop and closed-loop evaluations demonstrate the effectiveness of DriveWorld-VLA, which achieves state-of-the-art performance with 91.3 PDMS on NAVSIMv1, 86.8 EPDMS on NAVSIMv2, and 0.16 3-second average collision rate on nuScenes. Code and models will be released in https://github.com/liulin815/DriveWorld-VLA.git.
翻译:端到端自动驾驶近期日益受到关注,其旨在将视觉-语言-动作与世界模型相统一,以增强决策与前瞻想象能力。然而,现有方法因潜在状态共享不足,难以在单一架构内有效统一未来场景演化与动作规划,限制了视觉想象对动作决策的影响。为克服此局限,我们提出DriveWorld-VLA——一种在表示层面紧密集成视觉-语言-动作与世界模型的新框架,通过将世界建模与规划统一于潜在空间,使视觉-语言-动作规划器能直接从整体场景演化建模中获益,并减少对密集标注监督的依赖。此外,DriveWorld-VLA将世界模型的潜在状态作为视觉-语言-动作规划器的核心决策状态,促使规划器能评估候选动作如何影响未来场景演化。通过在潜在空间中完整进行世界建模,DriveWorld-VLA支持特征层面可控的动作条件想象,避免了昂贵的像素级推演。大量开环与闭环实验验证了DriveWorld-VLA的有效性:其在NAVSIMv1上获得91.3 PDMS,在NAVSIMv2上获得86.8 EPDMS,在nuScenes上实现0.16的3秒平均碰撞率,均达到最先进性能。代码与模型将发布于 https://github.com/liulin815/DriveWorld-VLA.git。