Optimus-3: Dual-Router Aligned Mixture-of-Experts Agent with Dual-Granularity Reasoning-Aware Policy Optimization

Developing generalist agents capable of solving open-ended tasks in visually rich, dynamic environments remains a core pursuit of embodied AI. While Minecraft has emerged as a compelling benchmark, existing agents often suffer from fragmented cognitive abilities, lacking the synergy between reflexive execution (System 1) and deliberative reasoning (System 2). In this paper, we introduce Optimus-3, a generalist agent that organically integrates these dual capabilities within a unified framework. To achieve this, we address three fundamental challenges. First, to overcome the scarcity of reasoning data, we propose a Knowledge-Enhanced Automated Data Generation Pipeline. It synthesizes high-quality System 2 reasoning traces from raw System 1 interaction trajectories, effectively mitigating hallucinations via injection of domain knowledge. We release the resulting dataset, \textbf{OptimusM$^{4}$}, to the community. Second, to reconcile the dichotomous computational requirements of the dual systems, we design a Dual-Router Aligned MoE Architecture. It employs a Task Router to prevent task interference via parameter decoupling, and a Layer Router to dynamically modulate reasoning depth, creating a computational ``Fast Path'' for System 1 and a ``Deep Path'' for System 2. Third, to activate the reasoning capabilities of System 2, we propose Dual-Granularity Reasoning-Aware Policy Optimization (DGRPO) algorithm. It enforces Process-Outcome Co-Supervision via dual-granularity dense rewards, ensuring consistency between the thought process and the answer. Extensive evaluations demonstrate that Optimus-3 surpasses existing state-of-the-art methods on both System~2 (21$\%$ on Planning, 66\% on Captioning, 76\% on Embodied QA, 3.4$\times$ on Grounding, and 18\% on Reflection) and System~1 (3\% on Long-Horizon Action) tasks, with a notable 60\% success rate on open-ended tasks.

翻译：开发能够在视觉丰富、动态的开放世界中解决开放式任务的通用智能体，仍然是具身人工智能的核心追求。尽管《我的世界》已成为一个引人注目的基准测试环境，但现有智能体通常存在认知能力碎片化的问题，缺乏反射性执行（系统1）与深思熟虑推理（系统2）之间的协同作用。本文介绍Optimus-3，这是一种在统一框架内有机整合这两种能力的通用智能体。为实现这一目标，我们解决了三个基本挑战。首先，为克服推理数据的稀缺性，我们提出了一种知识增强的自动化数据生成流水线。该流水线从原始的系统1交互轨迹中合成高质量的系统2推理轨迹，并通过注入领域知识有效缓解幻觉问题。我们将生成的名为\textbf{OptimusM$^{4}$}的数据集向社区开源。其次，为协调双系统在计算需求上的二分性，我们设计了一种双路由对齐的专家混合体架构。它采用一个任务路由器通过参数解耦来防止任务干扰，以及一个层路由器来动态调节推理深度，从而为系统1创建一条计算“快速路径”，为系统2创建一条“深度路径”。第三，为激活系统2的推理能力，我们提出了双粒度推理感知策略优化算法。该算法通过双粒度密集奖励强制执行过程-结果协同监督，确保思维过程与答案之间的一致性。广泛的评估表明，Optimus-3在系统2任务（规划任务提升21%，字幕任务提升66%，具身问答任务提升76%，基础任务提升3.4倍，反思任务提升18%）和系统1任务（长时程动作任务提升3%）上均超越了现有的最先进方法，在开放式任务上取得了高达60%的成功率。