In robotic visuomotor policy learning, diffusion-based models have achieved significant success in improving the accuracy of action trajectory generation compared to traditional autoregressive models. However, they suffer from inefficiency due to multiple denoising steps and limited flexibility from complex constraints. In this paper, we introduce Coarse-to-Fine AutoRegressive Policy (CARP), a novel paradigm for visuomotor policy learning that redefines the autoregressive action generation process as a coarse-to-fine, next-scale approach. CARP decouples action generation into two stages: first, an action autoencoder learns multi-scale representations of the entire action sequence; then, a GPT-style transformer refines the sequence prediction through a coarse-to-fine autoregressive process. This straightforward and intuitive approach produces highly accurate and smooth actions, matching or even surpassing the performance of diffusion-based policies while maintaining efficiency on par with autoregressive policies. We conduct extensive evaluations across diverse settings, including single-task and multi-task scenarios on state-based and image-based simulation benchmarks, as well as real-world tasks. CARP achieves competitive success rates, with up to a 10% improvement, and delivers 10x faster inference compared to state-of-the-art policies, establishing a high-performance, efficient, and flexible paradigm for action generation in robotic tasks.
翻译:在机器人视觉运动策略学习中,基于扩散的模型相较于传统自回归模型,在提升动作轨迹生成精度方面取得了显著成功。然而,它们因需要多次去噪步骤而效率低下,且受限于复杂的约束条件而灵活性不足。本文提出粗到细自回归策略(CARP),这是一种用于视觉运动策略学习的新范式,它将自回归动作生成过程重新定义为一种粗到细、逐级递进的方法。CARP将动作生成解耦为两个阶段:首先,一个动作自编码器学习整个动作序列的多尺度表示;随后,一个GPT风格的Transformer通过粗到细的自回归过程对序列预测进行细化。这种直观且直接的方法能够生成高精度且平滑的动作,其性能匹配甚至超越了基于扩散的策略,同时保持了与自回归策略相当的效率。我们在多种设置下进行了广泛评估,包括基于状态和基于图像的仿真基准测试中的单任务与多任务场景,以及真实世界任务。CARP实现了具有竞争力的成功率,最高可提升10%,并且推理速度比最先进的策略快10倍,从而为机器人任务中的动作生成建立了一个高性能、高效且灵活的范式。