Vision-Language-Action (VLA) models have emerged as essential generalist robot policies for diverse manipulation tasks, conventionally relying on directly translating multimodal inputs into actions via Vision-Language Model (VLM) embeddings. Recent advancements have introduced explicit intermediary reasoning, such as sub-task prediction (language) or goal image synthesis (vision), to guide action generation. However, these intermediate reasoning are often indirect and inherently limited in their capacity to convey the full, granular information required for precise action execution. Instead, we posit that the most effective form of reasoning is one that deliberates directly in the action space. We introduce Action Chain-of-Thought (ACoT), a paradigm where the reasoning process itself is formulated as a structured sequence of coarse action intents that guide the final policy. In this paper, we propose ACoT-VLA, a novel architecture that materializes the ACoT paradigm. Specifically, we introduce two complementary components: an Explicit Action Reasoner (EAR) and Implicit Action Reasoner (IAR). The former proposes coarse reference trajectories as explicit action-level reasoning steps, while the latter extracts latent action priors from internal representations of multimodal input, co-forming an ACoT that conditions the downstream action head to enable grounded policy learning. Extensive experiments in real-world and simulation environments demonstrate the superiority of our proposed method, which achieves 98.5%, 84.1%, and 47.4% on LIBERO, LIBERO-Plus and VLABench, respectively.
翻译:视觉-语言-动作(VLA)模型已成为处理多样化操作任务的关键通用机器人策略,其传统方法依赖于通过视觉-语言模型(VLM)嵌入将多模态输入直接转换为动作。近期研究引入了显式中间推理,例如子任务预测(语言)或目标图像合成(视觉),以指导动作生成。然而,这些中间推理往往具有间接性,且本质上难以传递精确动作执行所需的完整、细粒度信息。相反,我们认为最有效的推理形式应直接在动作空间中进行推演。本文提出动作思维链(ACoT),该范式将推理过程本身构建为一系列结构化粗粒度动作意图序列,以指导最终策略。本文提出ACoT-VLA,一种实现ACoT范式的新型架构。具体而言,我们引入了两个互补组件:显式动作推理器(EAR)和隐式动作推理器(IAR)。前者提出粗粒度参考轨迹作为显式动作级推理步骤,后者则从多模态输入的内部表示中提取潜在动作先验,共同构成ACoT,用于调节下游动作头以实现具身策略学习。在真实世界与仿真环境中的大量实验证明了所提方法的优越性,其在LIBERO、LIBERO-Plus和VLABench数据集上分别达到了98.5%、84.1%和47.4%的性能指标。