Current Vision-Language-Action (VLA) paradigms in autonomous driving primarily rely on Imitation Learning (IL), which introduces inherent challenges such as distribution shift and causal confusion. Online Reinforcement Learning offers a promising pathway to address these issues through trial-and-error learning. However, applying online reinforcement learning to VLA models in autonomous driving is hindered by inefficient exploration in continuous action spaces. To overcome this limitation, we propose MindDrive, a VLA framework comprising a large language model (LLM) with two distinct sets of LoRA parameters. The one LLM serves as a Decision Expert for scenario reasoning and driving decision-making, while the other acts as an Action Expert that dynamically maps linguistic decisions into feasible trajectories. By feeding trajectory-level rewards back into the reasoning space, MindDrive enables trial-and-error learning over a finite set of discrete linguistic driving decisions, instead of operating directly in a continuous action space. This approach effectively balances optimal decision-making in complex scenarios, human-like driving behavior, and efficient exploration in online reinforcement learning. MindDrive achieves strong closed-loop performance on the challenging Bench2Drive benchmark, with a Driving Score (DS) of 78.04 and a Success Rate (SR) of 55.09%. To the best of our knowledge, this is the first work to demonstrate the effectiveness of online reinforcement learning for the VLA model in autonomous driving.
翻译:当前自动驾驶领域的视觉-语言-动作范式主要依赖于模仿学习,这带来了分布偏移和因果混淆等固有挑战。在线强化学习通过试错学习为解决这些问题提供了有前景的途径。然而,将在线强化学习应用于自动驾驶的VLA模型受到连续动作空间中探索效率低下的阻碍。为克服这一限制,我们提出了MindDrive,一种包含具有两套不同LoRA参数的大型语言模型的VLA框架。其中一个LLM作为决策专家进行场景推理与驾驶决策,另一个则作为动作专家,将语言决策动态映射为可行轨迹。通过将轨迹级奖励反馈至推理空间,MindDrive实现了在有限离散语言驾驶决策集合上的试错学习,而非直接在连续动作空间中操作。该方法有效平衡了复杂场景中的最优决策、类人驾驶行为以及在线强化学习中的高效探索。MindDrive在具有挑战性的Bench2Drive基准测试中取得了优异的闭环性能,驾驶评分达到78.04%,成功率为55.09%。据我们所知,这是首个证明在线强化学习在自动驾驶VLA模型中有效性的研究工作。