HALO: A Unified Vision-Language-Action Model for Embodied Multimodal Chain-of-Thought Reasoning

Vision-Language-Action (VLA) models have shown strong performance in robotic manipulation, but often struggle in long-horizon or out-of-distribution scenarios due to the lack of explicit mechanisms for multimodal reasoning and anticipating how the world will evolve under action. Recent works introduce textual chain-of-thought or visual subgoal prediction within VLA models to reason, but still fail to offer a unified human-like reasoning framework for joint textual reasoning, visual foresight, and action prediction. To this end, we propose HALO, a unified VLA model that enables embodied multimodal chain-of-thought (EM-CoT) reasoning through a sequential process of textual task reasoning, visual subgoal prediction for fine-grained guidance, and EM-CoT-augmented action prediction. We instantiate HALO with a Mixture-of-Transformers (MoT) architecture that decouples semantic reasoning, visual foresight, and action prediction into specialized experts while allowing seamless cross-expert collaboration. To enable HALO learning at scale, we introduce an automated pipeline to synthesize EM-CoT training data along with a carefully crafted training recipe. Extensive experiments demonstrate that: (1) HALO achieves superior performance in both simulated and real-world environments, surpassing baseline policy pi_0 by 34.1% on RoboTwin benchmark; (2) all proposed components of the training recipe and EM-CoT design help improve task success rate; and (3) HALO exhibits strong generalization capabilities under aggressive unseen environmental randomization with our proposed EM-CoT reasoning.

翻译：视觉-语言-动作（VLA）模型在机器人操作任务中展现出强大性能，但在长视野或分布外场景中常因缺乏显式的多模态推理机制以及对动作引发世界状态演变的预判能力而表现不佳。近期研究尝试在VLA模型中引入文本思维链或视觉子目标预测以提升推理能力，但仍未能构建统一的人类式推理框架来实现文本推理、视觉前瞻与动作预测的协同。为此，我们提出HALO——一个通过文本任务推理、细粒度视觉子目标预测及EM-CoT增强动作预测的序列化流程，实现具身多模态思维链（EM-CoT）推理的统一VLA模型。我们采用混合专家Transformer（MoT）架构实例化HALO，该架构将语义推理、视觉前瞻与动作预测解耦至专用专家模块，同时支持跨专家的无缝协作。为实现大规模HALO训练，我们开发了自动化流水线以合成EM-CoT训练数据，并设计了精细的训练方案。大量实验表明：（1）HALO在仿真与真实环境中均取得卓越性能，在RoboTwin基准上超越基线策略π_0达34.1%；（2）训练方案与EM-CoT设计的所有组件均能提升任务成功率；（3）在采用我们提出的EM-CoT推理机制时，HALO在激进未见环境随机化条件下展现出强大的泛化能力。