QUAR-VLA: Vision-Language-Action Model for Quadruped Robots

The important manifestation of robot intelligence is the ability to naturally interact and autonomously make decisions. Traditional approaches to robot control often compartmentalize perception, planning, and decision-making, simplifying system design but limiting the synergy between different information streams. This compartmentalization poses challenges in achieving seamless autonomous reasoning, decision-making, and action execution. To address these limitations, a novel paradigm, named Vision-Language-Action tasks for QUAdruped Robots (QUAR-VLA), has been introduced in this paper. This approach tightly integrates visual information and instructions to generate executable actions, effectively merging perception, planning, and decision-making. The central idea is to elevate the overall intelligence of the robot. Within this framework, a notable challenge lies in aligning fine-grained instructions with visual perception information. This emphasizes the complexity involved in ensuring that the robot accurately interprets and acts upon detailed instructions in harmony with its visual observations. Consequently, we propose QUAdruped Robotic Transformer (QUART), a family of VLA models to integrate visual information and instructions from diverse modalities as input and generates executable actions for real-world robots and present QUAdruped Robot Dataset (QUARD), a large-scale multi-task dataset including navigation, complex terrain locomotion, and whole-body manipulation tasks for training QUART models. Our extensive evaluation (4000 evaluation trials) shows that our approach leads to performant robotic policies and enables QUART to obtain a range of emergent capabilities.

翻译：机器人智能的重要体现是其自然交互与自主决策能力。传统机器人控制方法常将感知、规划与决策分离，虽简化了系统设计，却限制了不同信息流之间的协同作用。这种分离式设计使得实现无缝的自主推理、决策与动作执行面临挑战。为突破上述局限，本文提出一种名为“四足机器人视觉-语言-动作任务”（QUAR-VLA）的新型范式。该方法深度融合视觉信息与指令，直接生成可执行动作，将感知、规划与决策过程有效整合，其核心理念在于提升机器人的整体智能水平。在该框架下，关键挑战在于实现细粒度指令与视觉感知信息的精确对齐，这凸显了确保机器人在视觉观测指导下准确理解并执行复杂指令的复杂性。为此，我们提出四足机器人变换器（QUART）——一系列视觉-语言-动作（VLA）模型，该模型将不同模态的视觉信息与指令作为输入，为真实世界机器人生成可执行动作；同时构建四足机器人数据集（QUARD），这是一个包含导航、复杂地形移动及全身操控任务的大规模多任务数据集，用于训练QUART模型。通过大规模评估（4000次试验）表明，本方法可生成高性能机器人策略，并使QUART具备一系列涌现能力。