QUAR-VLA: Vision-Language-Action Model for Quadruped Robots

The important manifestation of robot intelligence is the ability to naturally interact and autonomously make decisions. Traditional approaches to robot control often compartmentalize perception, planning, and decision-making, simplifying system design but limiting the synergy between different information streams. This compartmentalization poses challenges in achieving seamless autonomous reasoning, decision-making, and action execution. To address these limitations, a novel paradigm, named Vision-Language-Action tasks for QUAdruped Robots (QUAR-VLA), has been introduced in this paper. This approach tightly integrates visual information and instructions to generate executable actions, effectively merging perception, planning, and decision-making. The central idea is to elevate the overall intelligence of the robot. Within this framework, a notable challenge lies in aligning fine-grained instructions with visual perception information. This emphasizes the complexity involved in ensuring that the robot accurately interprets and acts upon detailed instructions in harmony with its visual observations. Consequently, we propose QUAdruped Robotic Transformer (QUART), a family of VLA models to integrate visual information and instructions from diverse modalities as input and generates executable actions for real-world robots and present QUAdruped Robot Dataset (QUARD), a large-scale multi-task dataset including navigation, complex terrain locomotion, and whole-body manipulation tasks for training QUART models. Our extensive evaluation (4000 evaluation trials) shows that our approach leads to performant robotic policies and enables QUART to obtain a range of emergent capabilities.

翻译：机器人智能的重要体现在于能够自然交互并自主决策。传统的机器人控制方法通常将感知、规划与决策模块分离，虽简化了系统设计，却限制了不同信息流之间的协同。这种割裂状态为实现无缝的自主推理、决策与动作执行带来了挑战。为克服这些局限，本文提出了一种名为"四足机器人视觉-语言-动作任务"（QUAR-VLA）的新范式。该方法通过紧密整合视觉信息与指令来生成可执行动作，有效融合了感知、规划与决策过程，其核心思想在于提升机器人的整体智能水平。在此框架中，一个显著挑战在于如何实现细粒度指令与视觉感知信息的对齐，这突显了确保机器人准确解析并基于视觉观察执行详细指令所涉及的复杂性。为此，我们提出了四足机器人Transformer模型系列（QUART），该系列视觉-语言-动作模型能够整合多模态视觉信息与指令作为输入，并为现实世界机器人生成可执行动作；同时构建了大规模多任务数据集QUARD，包含导航、复杂地形运动及全身操控任务，用于训练QUART模型。我们通过大规模评估（4000次实验）表明，该方法能够产生高性能的机器人策略，并使QUART获得一系列涌现能力。