迈向通用机器人策略：构建视觉-语言-动作模型的关键因素 (Towards Generalist Robot Policies: What Matters in Building Vision-Language-Action Models)

Foundation Vision Language Models (VLMs) exhibit strong capabilities in multi-modal representation learning, comprehension, and reasoning. By injecting action components into the VLMs, Vision-Language-Action Models (VLAs) can be naturally formed and also show promising performance. Existing work has demonstrated the effectiveness and generalization of VLAs in multiple scenarios and tasks. Nevertheless, the transfer from VLMs to VLAs is not trivial since existing VLAs differ in their backbones, action-prediction formulations, data distributions, and training recipes. This leads to a missing piece for a systematic understanding of the design choices of VLAs. In this work, we disclose the key factors that significantly influence the performance of VLA and focus on answering three essential design choices: which backbone to select, how to formulate the VLA architectures, and when to add cross-embodiment data. The obtained results convince us firmly to explain why we need VLA and develop a new family of VLAs, RoboVLMs, which require very few manual designs and achieve a new state-of-the-art performance in three simulation tasks and real-world experiments. Through our extensive experiments, which include over 8 VLM backbones, 4 policy architectures, and over 600 distinct designed experiments, we provide a detailed guidebook for the future design of VLAs. In addition to the study, the highly flexible RoboVLMs framework, which supports easy integrations of new VLMs and free combinations of various design choices, is made public to facilitate future research. We open-source all details, including codes, models, datasets, and toolkits, along with detailed training and evaluation recipes at: robovlms.github.io.

翻译：基础视觉语言模型（VLMs）在多模态表示学习、理解与推理方面展现出强大能力。通过向VLMs注入动作组件，视觉-语言-动作模型（VLAs）得以自然形成，并表现出卓越性能。现有研究已证明VLAs在多种场景与任务中的有效性和泛化能力。然而，从VLMs到VLAs的迁移并非易事，因为现有VLAs在骨干网络、动作预测形式、数据分布和训练方案上存在差异，导致对VLA设计选择的系统性理解尚存空白。本研究揭示了显著影响VLA性能的关键因素，重点探讨三个核心设计选择：如何选择骨干网络、如何构建VLA架构，以及何时引入跨具身数据。实验结果有力论证了为何需要VLA，并推动我们开发了新型VLA系列——RoboVLMs。该系列仅需极少人工设计，即在三项仿真任务和真实世界实验中达到新的最优性能。通过涵盖8种以上VLM骨干网络、4种策略架构及600余项差异化设计的实验，我们为未来VLA设计提供了详尽指南。除研究内容外，具备高度灵活性的RoboVLMs框架已公开，该框架支持新VLM的便捷集成与多种设计选择的自由组合，以促进后续研究。我们在robovlms.github.io开源了所有细节，包括代码、模型、数据集、工具包以及完整的训练与评估方案。