The emergence of Vision Language Action (VLA) models marks a paradigm shift from traditional policy-based control to generalized robotics, reframing Vision Language Models (VLMs) from passive sequence generators into active agents for manipulation and decision-making in complex, dynamic environments. This survey delves into advanced VLA methods, aiming to provide a clear taxonomy and a systematic, comprehensive review of existing research. It presents a comprehensive analysis of VLA applications across different scenarios and classifies VLA approaches into several paradigms: autoregression-based, diffusion-based, reinforcement-based, hybrid, and specialized methods; while examining their motivations, core strategies, and implementations in detail. In addition, foundational datasets, benchmarks, and simulation platforms are introduced. Building on the current VLA landscape, the review further proposes perspectives on key challenges and future directions to advance research in VLA models and generalizable robotics. By synthesizing insights from over three hundred recent studies, this survey maps the contours of this rapidly evolving field and highlights the opportunities and challenges that will shape the development of scalable, general-purpose VLA methods.
翻译:视觉语言动作(VLA)模型的出现标志着从传统基于策略的控制向广义机器人学的范式转变,将视觉语言模型(VLMs)从被动的序列生成器重新定义为能够在复杂、动态环境中进行操纵与决策的主动智能体。本综述深入探讨了先进的VLA方法,旨在提供清晰的分类体系,并对现有研究进行系统、全面的回顾。文章全面分析了VLA在不同场景中的应用,并将VLA方法归类为多种范式:基于自回归、基于扩散、基于强化学习、混合方法及专用方法;同时详细审视了其设计动机、核心策略与实现方式。此外,还介绍了基础数据集、基准测试与仿真平台。基于当前VLA研究现状,本文进一步提出了关于关键挑战与未来方向的展望,以推动VLA模型与可泛化机器人学的研究进展。通过综合三百余项近期研究的见解,本综述勾勒了这一快速发展领域的轮廓,并强调了将影响可扩展、通用型VLA方法发展的机遇与挑战。