Vision-Language-Action (VLA) models offer a promising autonomous driving paradigm for leveraging world knowledge and reasoning capabilities, especially in long-tail scenarios. However, existing VLA models often struggle with the high latency in action generation using an autoregressive generation framework and exhibit limited robustness. In this paper, we propose SpanVLA, a novel end-to-end autonomous driving framework, integrating an autoregressive reasoning and a flow-matching action expert. First, SpanVLA introduces an efficient bridge to leverage the vision and reasoning guidance of VLM to efficiently plan future trajectories using a flow-matching policy conditioned on historical trajectory initialization, which significantly reduces inference time. Second, to further improve the performance and robustness of the SpanVLA model, we propose a GRPO-based post-training method to enable the VLA model not only to learn from positive driving samples but also to learn how to avoid the typical negative behaviors and learn recovery behaviors. We further introduce mReasoning, a new real-world driving reasoning dataset, focusing on complex, reasoning-demanding scenarios and negative-recovery samples. Extensive experiments on the NAVSIM (v1 and v2) demonstrate the competitive performance of the SpanVLA model. Additionally, the qualitative results across diverse scenarios highlight the planning performance and robustness of our model.
翻译:视觉-语言-动作(VLA)模型为利用世界知识和推理能力实现自动驾驶提供了有前景的范式,尤其在长尾场景中表现突出。然而,现有VLA模型常面临自回归生成框架中动作生成的高延迟问题,且鲁棒性有限。本文提出SpanVLA——一种新型端到端自动驾驶框架,融合了自回归推理与流匹配动作专家。首先,SpanVLA引入高效桥接机制,利用视觉-语言模型(VLM)的视觉与推理指导,基于历史轨迹初始化的流匹配策略高效规划未来轨迹,显著降低推理时间。其次,为提升SpanVLA模型的性能与鲁棒性,我们提出基于GRPO的后训练方法,使VLA模型不仅能够从正向驾驶样本中学习,还能学会规避典型负向行为并掌握恢复行为。我们进一步提出mReasoning——一个新的真实世界驾驶推理数据集,聚焦复杂推理场景与负恢复样本。在NAVSIM(v1和v2)上的大量实验证明了SpanVLA模型的竞争性能。此外,跨多样化场景的定性结果凸显了本模型的规划性能与鲁棒性。