A fundamental challenge in autonomous driving is the integration of high-level, semantic reasoning for long-tail events with low-level, reactive control for robust driving. While large vision-language models (VLMs) trained on web-scale data offer powerful common-sense reasoning, they lack the grounded experience necessary for safe vehicle control. We posit that an effective autonomous agent should leverage the world knowledge of VLMs to guide a steerable driving policy toward robust control in driving scenarios. To this end, we propose SteerVLA, which leverages the reasoning capabilities of VLMs to produce fine-grained language instructions that steer a vision-language-action (VLA) driving policy. Key to our method is this rich language interface between the high-level VLM and low-level VLA, which allows the high-level policy to more effectively ground its reasoning in the control outputs of the low-level policy. To provide fine-grained language supervision aligned with vehicle control, we leverage a VLM to augment existing driving data with detailed language annotations, which we find to be essential for effective reasoning and steerability. We evaluate SteerVLA on a challenging closed-loop benchmark, where it outperforms state-of-the-art methods by 4.77 points in overall driving score and by 8.04 points on a long-tail subset. The project website is available at: https://steervla.github.io/.
翻译:自动驾驶的一个根本性挑战在于,如何将针对长尾事件的高层语义推理与实现鲁棒驾驶的低层反应式控制相结合。虽然在网络规模数据上训练的大型视觉-语言模型(VLMs)提供了强大的常识推理能力,但它们缺乏安全车辆控制所必需的具身体验。我们认为,一个有效的自主智能体应利用VLMs的世界知识,在驾驶场景中引导一个可操控的驾驶策略以实现鲁棒控制。为此,我们提出了SteerVLA,它利用VLMs的推理能力来生成细粒度的语言指令,以引导一个视觉-语言-动作(VLA)驾驶策略。我们方法的关键在于高层VLM与低层VLA之间丰富的语言接口,这使得高层策略能更有效地将其推理过程建立在低层策略的控制输出之上。为了提供与车辆控制对齐的细粒度语言监督,我们利用一个VLM来增强现有驾驶数据,为其添加详细的语言标注,我们发现这对于有效的推理和可操控性至关重要。我们在一个具有挑战性的闭环基准测试上评估了SteerVLA,其整体驾驶分数比最先进的方法高出4.77分,在长尾子集上高出8.04分。项目网站地址为:https://steervla.github.io/。