Vision-Language-Action (VLA) models are advancing autonomous driving by replacing modular pipelines with unified end-to-end architectures. However, current VLAs face two expensive requirements: (1) massive dataset collection, and (2) dense reasoning annotations. In this work, we address both challenges with \modelname (\textbf{No} \textbf{R}easoning for \textbf{D}riving). Compared to existing VLAs, \modelname achieves competitive performance while being fine-tuned on $<$60\% of the data and no reasoning annotations, resulting in 3$\times$ fewer tokens. We identify that standard Group Relative Policy Optimization (GRPO) fails to yield significant improvements when applied to policies trained on such small, reasoning-free datasets. We show that this limitation stems from difficulty bias, which disproportionately penalizes reward signals from scenarios that produce high-variance rollouts within GRPO. \modelname overcomes this by incorporating Dr.~GRPO, a recent algorithm designed to mitigate difficulty bias in LLMs. As a result, \modelname achieves competitive performance on Waymo and NAVSIM with a fraction of the training data and no reasoning overhead, enabling more efficient autonomous systems.
翻译:视觉-语言-动作(VLA)模型正通过以统一的端到端架构取代模块化流水线,推动自动驾驶的发展。然而,当前VLA模型面临两项成本高昂的要求:(1)大规模数据集收集,以及(2)密集的推理标注。在本工作中,我们通过\modelname(\textbf{No} \textbf{R}easoning for \textbf{D}riving)同时应对这两项挑战。与现有VLA模型相比,\modelname仅使用<$60\%$的数据进行微调且无需推理标注(生成的标记数量减少3倍),即实现了具有竞争力的性能。我们发现,当应用于在此类小型、无推理数据集上训练的策略时,标准的组相对策略优化(GRPO)无法带来显著改进。我们证明这一局限源于难度偏差——在GRPO框架内,该偏差会不成比例地惩罚那些产生高方差轨迹的场景所对应的奖励信号。\modelname通过引入Dr.~GRPO(一种旨在减轻大语言模型中难度偏差的最新算法)克服了此问题。因此,\modelname仅使用少量训练数据且无需推理开销,即在Waymo和NAVSIM上取得了具有竞争力的性能,从而实现了更高效的自动驾驶系统。