World Action Models (WAMs) offer a promising approach to embodied intelligence, yet existing methods rely heavily on video prediction as action priors and lack adaptive multimodal reasoning, limiting their effectiveness on long-horizon, complex tasks. We observe that WAMs require different multimodal reasoning modes under different execution contexts: textual reasoning is essential during task transitions to guide high-level action prediction, while visual reasoning is critical during fine-grained manipulation for precise control. Motivated by this observation, we propose \textbf{AdaWAM}, a world action model with adaptive multimodal reasoning abilities. AdaWAM integrates a lightweight dynamic router that autonomously triggers textual or visual reasoning as needed during task execution. Experiments on both simulated and real-world embodied tasks show that AdaWAM substantially improves inference efficiency while outperforming state-of-the-art embodied policies. Codes and demos are available at: https://adawam.github.io/.
翻译:世界动作模型为具身智能提供了一条有前景的路径,但现有方法过度依赖视频预测作为动作先验,且缺乏自适应多模态推理能力,这限制了其在长时域、复杂任务上的有效性。我们观察到,在任务执行的不同情境下,世界动作模型需要不同的多模态推理模式:在任务转换阶段,文本推理对于引导高层动作预测至关重要;而在精细操作阶段,视觉推理对于精确控制尤为关键。受此观察启发,我们提出了**AdaWAM**——一种具有自适应多模态推理能力的世界动作模型。AdaWAM集成了一个轻量级的动态路由器,可根据任务执行需求自动触发文本或视觉推理。在仿真和真实世界的具身任务实验中,AdaWAM显著提升了推理效率,同时超越了最先进的具身策略。代码与演示视频目前可在 https://adawam.github.io/ 获取。