Vision-Language-Action models (VLAs) hold immense promise for enabling generalist robot manipulation. However, the best way to build them remains an open question. Current approaches often add complexity, such as modifying the existing vocabulary of a Vision-Language Model (VLM) with action tokens or introducing special action heads. Curiously, the simplest strategy of representing actions directly as text has remained largely unexplored. This work introduces VLA-0 to investigate this idea. We find that VLA-0 is not only effective; it is surprisingly powerful. With the right design, VLA-0 outperforms more involved models. On LIBERO, a popular benchmark for evaluating VLAs, VLA-0 outperforms all existing methods trained on the same robotic data, including $\pi_0.5$-KI, OpenVLA-OFT and SmolVLA. Furthermore, without large-scale robotics-specific training, it outperforms methods trained on large-scale robotic data, like $\pi_0.5$-KI, $\pi_0$, GR00T-N1 and MolmoAct. These findings also translate to the real world, where VLA-0 outperforms SmolVLA, a VLA model pre-trained on large-scale real data. This paper summarizes our unexpected findings and spells out the specific techniques required to unlock the high performance of this simple yet potent VLA design. Visual results, code, and trained models are provided here: https://vla0.github.io/.
翻译:视觉-语言-动作模型(VLA)在实现通用机器人操作方面具有巨大潜力。然而,构建此类模型的最佳方式仍是一个开放性问题。当前方法通常引入复杂性,例如通过动作令牌修改现有视觉-语言模型(VLM)的词汇表,或引入特殊的动作头部。值得注意的是,将动作直接表示为文本这一最简单策略在很大程度上尚未被探索。本研究提出VLA-0来验证这一构想。我们发现VLA-0不仅有效,而且展现出惊人的强大性能。通过合理设计,VLA-0的表现优于更复杂的模型。在评估VLA的流行基准LIBERO上,VLA-0在相同机器人数据训练条件下超越了所有现有方法,包括$\pi_0.5$-KI、OpenVLA-OFT和SmolVLA。此外,即使未进行大规模机器人专项训练,其表现也优于基于大规模机器人数据训练的方法,如$\pi_0.5$-KI、$\pi_0$、GR00T-N1和MolmoAct。这些发现在现实场景中同样成立:VLA-0超越了基于大规模真实数据预训练的VLA模型SmolVLA。本文总结了这些意外发现,并详细阐述了释放这种简洁而强大VLA设计高性能所需的具体技术。可视化结果、代码及训练模型发布于:https://vla0.github.io/。