We present Chain-of-Action (CoA), a novel visuo-motor policy paradigm built upon Trajectory Autoregressive Modeling. Unlike conventional approaches that predict next step action(s) forward, CoA generates an entire trajectory by explicit backward reasoning with task-specific goals through an action-level Chain-of-Thought (CoT) process. This process is unified within a single autoregressive structure: (1) the first token corresponds to a stable keyframe action that encodes the task-specific goals; and (2) subsequent action tokens are generated autoregressively, conditioned on the initial keyframe and previously predicted actions. This backward action reasoning enforces a global-to-local structure, allowing each local action to be tightly constrained by the final goal. To further realize the action reasoning structure, CoA incorporates four complementary designs: continuous action token representation; dynamic stopping for variable-length trajectory generation; reverse temporal ensemble; and multi-token prediction to balance action chunk modeling with global structure. As a result, CoA gives strong spatial generalization capabilities while preserving the flexibility and simplicity of a visuo-motor policy. Empirically, we observe CoA achieves the state-of-the-art performance across 60 RLBench tasks and 8 real-world manipulation tasks.
翻译:本文提出动作链(CoA),一种基于轨迹自回归建模的新型视觉运动策略范式。与传统方法前向预测下一步动作不同,CoA通过动作级思维链(CoT)过程,结合任务特定目标进行显式逆向推理,生成完整轨迹。该过程统一于单一自回归结构中:(1)首个令牌对应编码任务特定目标的稳定关键帧动作;(2)后续动作令牌以初始关键帧及先前预测动作为条件进行自回归生成。这种逆向动作推理强制形成全局到局部的结构,使每个局部动作受到最终目标的严格约束。为实现该动作推理结构,CoA融合四项互补设计:连续动作令牌表征;可变长度轨迹生成的动态停止机制;逆向时序集成;以及平衡动作块建模与全局结构的多令牌预测。因此,CoA在保持视觉运动策略灵活性与简洁性的同时,展现出强大的空间泛化能力。实证研究表明,CoA在60项RLBench任务和8项真实世界操作任务中均达到最先进的性能水平。