QUAR-VLA: Vision-Language-Action Model for Quadruped Robots

The important manifestation of robot intelligence is the ability to naturally interact and autonomously make decisions. Traditional approaches to robot control often compartmentalize perception, planning, and decision-making, simplifying system design but limiting the synergy between different information streams. This compartmentalization poses challenges in achieving seamless autonomous reasoning, decision-making, and action execution. To address these limitations, a novel paradigm, named Vision-Language-Action tasks for QUAdruped Robots (QUAR-VLA), has been introduced in this paper. This approach tightly integrates visual information and instructions to generate executable actions, effectively merging perception, planning, and decision-making. The central idea is to elevate the overall intelligence of the robot. Within this framework, a notable challenge lies in aligning fine-grained instructions with visual perception information. This emphasizes the complexity involved in ensuring that the robot accurately interprets and acts upon detailed instructions in harmony with its visual observations. Consequently, we propose QUAdruped Robotic Transformer (QUART), a family of VLA models to integrate visual information and instructions from diverse modalities as input and generates executable actions for real-world robots and present QUAdruped Robot Dataset (QUARD), a large-scale multi-task dataset including navigation, complex terrain locomotion, and whole-body manipulation tasks for training QUART models. Our extensive evaluation (4000 evaluation trials) shows that our approach leads to performant robotic policies and enables QUART to obtain a range of emergent capabilities.

翻译：机器人智能的重要体现是能够自然交互并自主决策。传统的机器人控制方法通常将感知、规划与决策模块分离，这简化了系统设计，但限制了不同信息流之间的协同。这种割裂为实现无缝的自主推理、决策与动作执行带来了挑战。为应对这些局限，本文引入了一种名为四足机器人视觉-语言-动作任务（QUAR-VLA）的新范式。该方法紧密整合视觉信息与指令以生成可执行动作，有效融合了感知、规划与决策。其核心思想在于提升机器人的整体智能水平。在此框架下，一个显著的挑战在于如何将细粒度指令与视觉感知信息对齐，这突显了确保机器人能准确解析并基于其视觉观察协调执行详细指令所涉及的复杂性。为此，我们提出了四足机器人Transformer（QUART）系列模型，它能够整合来自多模态的视觉信息与指令作为输入，并为现实世界机器人生成可执行动作；同时，我们构建了四足机器人数据集（QUARD），这是一个包含导航、复杂地形运动及全身操控任务的大规模多任务数据集，用于训练QUART模型。我们的大规模评估（4000次评估试验）表明，该方法能够产生高性能的机器人策略，并使QUART获得一系列涌现能力。