Embodied AI is widely recognized as a cornerstone of artificial general intelligence because it involves controlling embodied agents to perform tasks in the physical world. Building on the success of large language models and vision-language models, a new category of multimodal models -- referred to as vision-language-action models (VLAs) -- has emerged to address language-conditioned robotic tasks in embodied AI by leveraging their distinct ability to generate actions. The recent proliferation of VLAs necessitates a comprehensive survey to capture the rapidly evolving landscape. To this end, we present the first survey on VLAs for embodied AI. This work provides a detailed taxonomy of VLAs, organized into three major lines of research. The first line focuses on individual components of VLAs. The second line is dedicated to developing VLA-based control policies adept at predicting low-level actions. The third line comprises high-level task planners capable of decomposing long-horizon tasks into a sequence of subtasks, thereby guiding VLAs to follow more general user instructions. Furthermore, we provide an extensive summary of relevant resources, including datasets, simulators, and benchmarks. Finally, we discuss the challenges facing VLAs and outline promising future directions in embodied AI. A curated repository associated with this survey is available at: https://github.com/yueen-ma/Awesome-VLA.
翻译:具身人工智能被广泛认为是实现通用人工智能的基石,因为它涉及控制具身体在物理世界中执行任务。基于大语言模型和视觉-语言模型取得的成功,一类新型多模态模型——即视觉-语言-动作模型——应运而生,旨在通过其生成动作的独特能力,解决具身人工智能中语言条件化的机器人任务。近期VLA模型的激增亟需一份全面的综述以把握这一快速发展的领域。为此,我们首次提出了针对具身人工智能中VLA模型的综述。本工作提供了VLA的详细分类体系,将其归纳为三个主要研究方向。第一条研究主线聚焦于VLA的各个组成部分。第二条主线致力于开发基于VLA的控制策略,该策略擅长预测低层动作。第三条主线则包含能够将长视野任务分解为一系列子任务的高层任务规划器,从而引导VLA遵循更通用的用户指令。此外,我们对相关资源进行了广泛总结,包括数据集、模拟器和基准测试。最后,我们讨论了VLA当前面临的挑战,并展望了具身人工智能领域未来有前景的发展方向。与本综述相关的精选资源库可在以下网址获取:https://github.com/yueen-ma/Awesome-VLA。