Vision-Language-Action models (VLAs) represent a significant frontier in embodied intelligence, aiming to bridge digital knowledge with physical-world interaction. Despite their remarkable performance, foundational VLAs are hindered by the prohibitive computational and data demands inherent to their large-scale architectures. While a surge of recent research has focused on enhancing VLA efficiency, the field lacks a unified framework to consolidate these disparate advancements. To bridge this gap, this survey presents the first comprehensive review of Efficient Vision-Language-Action models (Efficient VLAs) across the entire model-training-data pipeline. Specifically, we introduce a unified taxonomy to systematically organize the disparate efforts in this domain, categorizing current techniques into three core pillars: (1) Efficient Model Design, focusing on efficient architectures and model compression; (2) Efficient Training, which reduces computational burdens during model learning; and (3) Efficient Data Collection, which addresses the bottlenecks in acquiring and utilizing robotic data. Through a critical review of state-of-the-art methods within this framework, this survey not only establishes a foundational reference for the community but also summarizes representative applications, delineates key challenges, and charts a roadmap for future research. We maintain a continuously updated project page to track our latest developments: https://evla-survey.github.io/.
翻译:视觉-语言-动作模型(VLAs)代表了具身智能领域的重要前沿,其目标在于连接数字知识与物理世界交互。尽管展现出卓越性能,基础VLA模型受限于其大规模架构固有的巨大计算与数据需求。虽然近期大量研究聚焦于提升VLA效率,该领域仍缺乏统一框架来整合这些分散的进展。为弥合此空白,本综述首次对贯穿模型-训练-数据全流程的高效视觉-语言-动作模型(Efficient VLAs)进行全面回顾。具体而言,我们提出统一分类法以系统梳理该领域的分散研究成果,将现有技术归纳为三大核心支柱:(1)高效模型设计,聚焦高效架构与模型压缩;(2)高效训练,降低模型学习过程中的计算负担;(3)高效数据收集,解决机器人数据获取与利用的瓶颈。通过在此框架下对前沿方法进行批判性评述,本综述不仅为学界建立基础性参考,还总结了代表性应用场景,阐明关键挑战,并绘制未来研究路线图。我们维护持续更新的项目页面以追踪最新进展:https://evla-survey.github.io/。