Vision-Language-Action models have emerged as essential generalist robot policies for diverse manipulation tasks, conventionally relying on directly translating multimodal inputs into actions via Vision-Language Model embeddings. Recent advancements have introduced explicit intermediary reasoning-such as sub-task prediction (language) or goal image synthesis (vision)-to guide action generation. However, these intermediate reasoning are often indirect and inherently limited in their capacity to convey the full, granular information required for precise action execution. Instead, we posit that the most effective form of reasoning is one that deliberates directly in the action space. We introduce Action Chain-of-Thought (ACoT), a paradigm where the reasoning process itself is formulated as a structured sequence of coarse action intents that guide the final policy. In this paper, we propose ACoT-VLA, a novel architecture that materializes the ACoT paradigm. Specifically, we introduce two complementary components: an Explicit Action Reasoner (EAR) and Implicit Action Reasoner (IAR). The former proposes coarse reference trajectories as explicit action-level reasoning steps, while the latter extracts latent action priors from internal representations of multimodal input, co-forming an ACoT that conditions the downstream action head to enable grounded policy learning. Extensive experiments in real-world and simulation environments demonstrate the superiority of our proposed method. Code is available at: https://github.com/AgibotTech/ACoT-VLA.
翻译:视觉-语言-动作模型已成为解决多样化操作任务的关键通用机器人策略,其传统方法依赖于通过视觉-语言模型嵌入直接将多模态输入转化为动作。最新进展引入了显式中间推理(如子任务预测(语言)或目标图像合成(视觉))来指导动作生成。然而,这些中间推理往往具有间接性,且本质上无法传达精确动作执行所需的完整粒度信息。本文提出,最有效的推理形式应是直接在动作空间中进行审慎思考。我们引入行动思维链(ACoT)范式,将推理过程本身构建为引导最终策略的结构化粗粒度动作意图序列。本文提出ACoT-VLA这一新型架构来实现ACoT范式。具体而言,我们引入两个互补组件:显式动作推理器(EAR)与隐式动作推理器(IAR)。前者生成粗粒度参考轨迹作为显式动作级推理步骤,后者则从多模态输入的内部表示中提取潜在动作先验,二者共同形成条件化下游动作头的ACoT,从而实现具有物理基础的策略学习。在真实环境与仿真环境中的大量实验证明了所提方法的优越性。代码开源地址:https://github.com/AgibotTech/ACoT-VLA。