Current Vision-Language-Action (VLA) paradigms in autonomous driving primarily rely on Imitation Learning (IL), which introduces inherent challenges such as distribution shift and causal confusion. Online Reinforcement Learning offers a promising pathway to address these issues through trial-and-error learning. However, applying online reinforcement learning to VLA models in autonomous driving is hindered by inefficient exploration in continuous action spaces. To overcome this limitation, we propose MindDrive, a VLA framework comprising a large language model (LLM) with two distinct sets of LoRA parameters. The one LLM serves as a Decision Expert for scenario reasoning and driving decision-making, while the other acts as an Action Expert that dynamically maps linguistic decisions into feasible trajectories. By feeding trajectory-level rewards back into the reasoning space, MindDrive enables trial-and-error learning over a finite set of discrete linguistic driving decisions, instead of operating directly in a continuous action space. This approach effectively balances optimal decision-making in complex scenarios, human-like driving behavior, and efficient exploration in online reinforcement learning. Using the lightweight Qwen-0.5B LLM, MindDrive achieves Driving Score (DS) of 78.04 and Success Rate (SR) of 55.09% on the challenging Bench2Drive benchmark. To the best of our knowledge, this is the first work to demonstrate the effectiveness of online reinforcement learning for the VLA model in autonomous driving.
翻译:当前自动驾驶中的视觉-语言-动作范式主要依赖于模仿学习,这带来了分布偏移和因果混淆等固有挑战。在线强化学习通过试错学习为解决这些问题提供了有前景的途径。然而,将在线强化学习应用于自动驾驶的VLA模型受到连续动作空间中探索效率低下的阻碍。为克服这一限制,我们提出了MindDrive——一种包含具有两组不同LoRA参数的大型语言模型的VLA框架。其中一个LLM作为决策专家负责场景推理与驾驶决策,另一个则作为动作专家,将语言决策动态映射为可行的轨迹。通过将轨迹级奖励反馈至推理空间,MindDrive实现了在有限离散语言驾驶决策集合上的试错学习,而非直接在连续动作空间中操作。该方法有效平衡了复杂场景中的最优决策、类人驾驶行为以及在线强化学习中的高效探索。使用轻量级Qwen-0.5B LLM,MindDrive在具有挑战性的Bench2Drive基准测试中取得了78.04的驾驶评分和55.09%的成功率。据我们所知,这是首个证明在线强化学习在自动驾驶VLA模型中有效性的研究工作。