自回归动作序列学习在机器人操作中的应用 (Autoregressive Action Sequence Learning for Robotic Manipulation)

Designing a universal policy architecture that performs well across diverse robots and task configurations remains a key challenge. In this work, we address this by representing robot actions as sequential data and generating actions through autoregressive sequence modeling. Existing autoregressive architectures generate end-effector waypoints sequentially as word tokens in language modeling, which are limited to low-frequency control tasks. Unlike language, robot actions are heterogeneous and often include continuous values -- such as joint positions, 2D pixel coordinates, and end-effector poses -- which are not easily suited for language-based modeling. Based on this insight, we introduce a straightforward enhancement: we extend causal transformers' single-token prediction to support predicting a variable number of tokens in a single step through our Chunking Causal Transformer (CCT). This enhancement enables robust performance across diverse tasks of various control frequencies, greater efficiency by having fewer autoregression steps, and lead to a hybrid action sequence design by mixing different types of actions and using a different chunk size for each action type. Based on CCT, we propose the Autoregressive Policy (ARP) architecture, which solves manipulation tasks by generating hybrid action sequences. We evaluate ARP across diverse robotic manipulation environments, including Push-T, ALOHA, and RLBench, and show that ARP, as a universal architecture, outperforms the environment-specific state-of-the-art in all tested benchmarks, while being more efficient in computation and parameter sizes. Videos of our real robot demonstrations, all source code and the pretrained models of ARP can be found at http://github.com/mlzxy/arp.

翻译：设计一种能在不同机器人和任务配置中均表现优异的通用策略架构，仍是一个关键挑战。本研究通过将机器人动作表示为序列数据，并利用自回归序列建模生成动作，以应对这一挑战。现有的自回归架构将末端执行器路径点像语言建模中的词元一样顺序生成，这仅限于低频控制任务。与语言不同，机器人动作具有异构性，通常包含连续值——例如关节位置、二维像素坐标和末端执行器位姿——这些并不容易适用于基于语言的建模。基于这一认识，我们引入了一种直接的增强方法：通过我们提出的分块因果Transformer（CCT），将因果Transformer的单词元预测扩展为支持在单步中预测可变数量的词元。这一增强使得模型能够在不同控制频率的多样化任务中实现鲁棒性能，通过减少自回归步骤提高效率，并支持通过混合不同类型的动作并为每种动作类型使用不同的分块大小，实现混合动作序列设计。基于CCT，我们提出了自回归策略（ARP）架构，该架构通过生成混合动作序列来解决操作任务。我们在多种机器人操作环境中评估了ARP，包括Push-T、ALOHA和RLBench，结果表明ARP作为一种通用架构，在所有测试基准中均优于针对特定环境的最先进方法，同时在计算和参数规模上更为高效。我们真实机器人演示的视频、所有源代码以及ARP的预训练模型均可在 http://github.com/mlzxy/arp 找到。