VLA-0: Building State-of-the-Art VLAs with Zero Modification

Vision-Language-Action models (VLAs) hold immense promise for enabling generalist robot manipulation. However, the best way to build them remains an open question. Current approaches often add complexity, such as modifying the existing vocabulary of a Vision-Language Model (VLM) with action tokens or introducing special action heads. Curiously, the simplest strategy of representing actions directly as text has remained largely unexplored. This work introduces VLA-0 to investigate this idea. We find that VLA-0 is not only effective; it is surprisingly powerful. With the right design, VLA-0 outperforms more involved models. On LIBERO, a popular benchmark for evaluating VLAs, VLA-0 outperforms all existing methods trained on the same robotic data, including $\pi_0.5$-KI, OpenVLA-OFT and SmolVLA. Furthermore, without large-scale robotics-specific training, it outperforms methods trained on large-scale robotic data, like $\pi_0.5$-KI, $\pi_0$, GR00T-N1 and MolmoAct. These findings also translate to the real world, where VLA-0 outperforms SmolVLA, a VLA model pre-trained on large-scale real data. This paper summarizes our unexpected findings and spells out the specific techniques required to unlock the high performance of this simple yet potent VLA design. Visual results, code, and trained models are provided here: https://vla0.github.io/.

翻译：视觉-语言-动作模型（VLA）在实现通用机器人操作方面具有巨大潜力。然而，构建此类模型的最佳方式仍是一个开放性问题。当前方法通常引入复杂性，例如通过动作令牌修改现有视觉-语言模型（VLM）的词汇表，或引入特殊的动作头部。值得注意的是，将动作直接表示为文本这一最简单策略在很大程度上尚未被探索。本研究提出VLA-0来验证这一构想。我们发现VLA-0不仅有效，而且展现出惊人的强大性能。通过合理设计，VLA-0的表现优于更复杂的模型。在评估VLA的流行基准LIBERO上，VLA-0在相同机器人数据训练条件下超越了所有现有方法，包括$\pi_0.5$-KI、OpenVLA-OFT和SmolVLA。此外，即使未进行大规模机器人专项训练，其表现也优于基于大规模机器人数据训练的方法，如$\pi_0.5$-KI、$\pi_0$、GR00T-N1和MolmoAct。这些发现在现实场景中同样成立：VLA-0超越了基于大规模真实数据预训练的VLA模型SmolVLA。本文总结了这些意外发现，并详细阐述了释放这种简洁而强大VLA设计高性能所需的具体技术。可视化结果、代码及训练模型发布于：https://vla0.github.io/。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日