视觉-语言-行动模型在自动驾驶中的应用：过去、现在与未来 (Vision-Language-Action Models for Autonomous Driving: Past, Present, and Future)

Tianshuai Hu,Xiaolu Liu,Song Wang,Yiyao Zhu,Ao Liang,Lingdong Kong,Guoyang Zhao,Zeying Gong,Jun Cen,Zhiyu Huang,Xiaoshuai Hao,Linfeng Li,Hang Song,Xiangtai Li,Jun Ma,Shaojie Shen,Jianke Zhu,Dacheng Tao,Ziwei Liu,Junwei Liang

from arxiv, Survey; 47 pages, 7 figures, 9 tables; GitHub Repo at https://github.com/worldbench/awesome-vla-for-ad

Autonomous driving has long relied on modular "Perception-Decision-Action" pipelines, where hand-crafted interfaces and rule-based components often break down in complex or long-tailed scenarios. Their cascaded design further propagates perception errors, degrading downstream planning and control. Vision-Action (VA) models address some limitations by learning direct mappings from visual inputs to actions, but they remain opaque, sensitive to distribution shifts, and lack structured reasoning or instruction-following capabilities. Recent progress in Large Language Models (LLMs) and multimodal learning has motivated the emergence of Vision-Language-Action (VLA) frameworks, which integrate perception with language-grounded decision making. By unifying visual understanding, linguistic reasoning, and actionable outputs, VLAs offer a pathway toward more interpretable, generalizable, and human-aligned driving policies. This work provides a structured characterization of the emerging VLA landscape for autonomous driving. We trace the evolution from early VA approaches to modern VLA frameworks and organize existing methods into two principal paradigms: End-to-End VLA, which integrates perception, reasoning, and planning within a single model, and Dual-System VLA, which separates slow deliberation (via VLMs) from fast, safety-critical execution (via planners). Within these paradigms, we further distinguish subclasses such as textual vs. numerical action generators and explicit vs. implicit guidance mechanisms. We also summarize representative datasets and benchmarks for evaluating VLA-based driving systems and highlight key challenges and open directions, including robustness, interpretability, and instruction fidelity. Overall, this work aims to establish a coherent foundation for advancing human-compatible autonomous driving systems.

翻译：自动驾驶长期以来依赖于模块化的“感知-决策-行动”流水线，其中手工设计的接口和基于规则的组件在复杂或长尾场景中常常失效。其级联设计进一步传播感知误差，导致下游规划与控制性能下降。视觉-行动（VA）模型通过学习从视觉输入到行动的直接映射，解决了部分局限性，但它们仍然不透明、对分布偏移敏感，且缺乏结构化推理或指令跟随能力。近期大型语言模型（LLMs）与多模态学习的进展推动了视觉-语言-行动（VLA）框架的出现，该框架将感知与基于语言的决策相结合。通过统一视觉理解、语言推理和可执行输出，VLA为实现更具可解释性、泛化性及符合人类预期的驾驶策略提供了路径。本文对自动驾驶领域新兴的VLA研究格局进行了系统性梳理。我们追溯了从早期VA方法到现代VLA框架的演进历程，并将现有方法归纳为两大主要范式：端到端VLA（在单一模型中整合感知、推理与规划）和双系统VLA（将慢速深思熟虑过程通过VLMs处理与快速安全关键执行通过规划器分离）。在这些范式内，我们进一步区分了文本型与数值型行动生成器、显式与隐式引导机制等子类别。同时，我们总结了用于评估基于VLA的驾驶系统的代表性数据集与基准，并指出了关键挑战与开放研究方向，包括鲁棒性、可解释性和指令忠实度。总体而言，本文旨在为推进人机兼容的自动驾驶系统建立连贯的研究基础。