In autonomous driving, dynamic environment and corner cases pose significant challenges to the robustness of ego vehicle's state understanding and decision making. We introduce VDRive, a novel pipeline for end-to-end autonomous driving that explicitly models state-action mapping to address these challenges, enabling interpretable and robust decision making. By leveraging the advancement of the state understanding of the Vision Language Action Model (VLA) with generative diffusion policy-based action head, our VDRive guides the driving contextually and geometrically. Contextually, VLA predicts future observations through token generation pre-training, where the observations are represented as discrete codes by a Conditional Vector Quantized Variational Autoencoder (CVQ-VAE). Geometrically, we perform reinforcement learning fine-tuning of the VLA to predict future trajectories and actions based on current driving conditions. VLA supplies the current state tokens and predicted state tokens for the action policy head to generate hierarchical actions and trajectories. During policy training, a learned critic evaluates the actions generated by the policy and provides gradient-based feedback, forming an actor-critic framework that enables a reinforcement-based policy learning pipeline. Experiments show that our VDRive achieves state-of-the-art performance in the Bench2Drive closed-loop benchmark and nuScenes open-loop planning.
翻译:在自动驾驶领域,动态环境与极端场景对自车状态理解与决策的鲁棒性构成重大挑战。本文提出VDRive——一种新颖的端到端自动驾驶框架,通过显式建模状态-动作映射以应对这些挑战,实现可解释且鲁棒的决策。通过结合视觉语言动作模型(VLA)在状态理解方面的进展与基于生成扩散策略的动作头,我们的VDRive能够从上下文与几何双重维度引导驾驶。在上下文层面,VLA通过词元生成预训练预测未来观测,其中观测由条件向量量化变分自编码器(CVQ-VAE)表示为离散编码。在几何层面,我们对VLA进行强化学习微调,使其能根据当前驾驶条件预测未来轨迹与动作。VLA为动作策略头提供当前状态词元与预测状态词元,以生成层次化动作与轨迹。在策略训练过程中,经学习的评判器评估策略生成的动作并提供基于梯度的反馈,构成演员-评判器框架,实现基于强化的策略学习流程。实验表明,我们的VDRive在Bench2Drive闭环基准测试与nuScenes开环规划任务中均达到最先进性能。