The important manifestation of robot intelligence is the ability to naturally interact and autonomously make decisions. Traditional approaches to robot control often compartmentalize perception, planning, and decision-making, simplifying system design but limiting the synergy between different information streams. This compartmentalization poses challenges in achieving seamless autonomous reasoning, decision-making, and action execution. To address these limitations, a novel paradigm, named Vision-Language-Action tasks for QUAdruped Robots (QUAR-VLA), has been introduced in this paper. This approach tightly integrates visual information and instructions to generate executable actions, effectively merging perception, planning, and decision-making. The central idea is to elevate the overall intelligence of the robot. Within this framework, a notable challenge lies in aligning fine-grained instructions with visual perception information. This emphasizes the complexity involved in ensuring that the robot accurately interprets and acts upon detailed instructions in harmony with its visual observations. Consequently, we propose QUAdruped Robotic Transformer (QUART), a family of VLA models to integrate visual information and instructions from diverse modalities as input and generates executable actions for real-world robots and present QUAdruped Robot Dataset (QUARD), a large-scale multi-task dataset including navigation, complex terrain locomotion, and whole-body manipulation tasks for training QUART models. Our extensive evaluation (4000 evaluation trials) shows that our approach leads to performant robotic policies and enables QUART to obtain a range of emergent capabilities.
翻译:机器人智能的重要体现是其自然交互与自主决策的能力。传统机器人控制方法通常将感知、规划与决策模块化分离,虽简化了系统设计,却限制了不同信息流之间的协同效应。这种模块化设计在实现无缝自主推理、决策与动作执行方面面临诸多挑战。为突破这些局限,本文提出一种名为"四足机器人视觉-语言-动作任务"(QUAR-VLA)的新型范式。该方法将视觉信息与指令紧密融合以生成可执行动作,有效整合感知、规划与决策过程,其核心目标在于提升机器人的整体智能水平。在此框架下,细粒度指令与视觉感知信息的对齐成为关键挑战,重点在于确保机器人能够精准解读并依据详细指令协同其视觉观测结果。为此,我们提出四足机器人Transformer(QUART)——一类整合多模态视觉信息与指令输入以生成真实机器人可执行动作的VLA模型系列,并构建包含导航、复杂地形运动及全身操控任务的大规模多任务训练数据集QUARD。通过4000次评估试验的全面验证表明,本方法可生成高性能机器人策略,并使QUART具备一系列涌现能力。