MotionWAM: Towards Foundation World Action Models for Real-Time Humanoid Loco-Manipulation

World Action Models (WAMs) couple a video dynamics prior to the policy and have shown encouraging results on tabletop manipulation, but iterative denoising over high-dimensional video-action latents leaves them too slow for real-time humanoid loco-manipulation. The problem is compounded by the dominant hierarchical paradigm, in which a high-level manipulation policy controls only the upper body while a low-level controller tracks coarse base commands -- placing upper and lower body in inconsistent action spaces and reducing the legs to balance-preserving locomotion. We present MotionWAM, a real-time WAM that drives autonomous humanoid loco-manipulation from a single egocentric camera by conditioning the policy on the intermediate denoising features of a video world model. MotionWAM replaces the upper-lower split with a unified motion latent and predicts whole-body motion tokens that jointly cover locomotion, torso motion, height regulation, foot interaction, and hand manipulation in a single action space. A three-stage learning framework progressively adapts the video world model to egocentric visual dynamics and to the target humanoid embodiment. On nine real-world Unitree G1 tasks, MotionWAM runs in real time, substantially outperforms Vision-Language-Action (VLA) baselines fine-tuned on the same demonstrations by over 30% in overall success rate, and executes task-driven foot interaction that decoupled upper-lower policies cannot reach. Our results suggest that video-pretrained WAMs can be lifted from tabletop manipulation to coordinated, human-like whole-body humanoid control.

翻译：摘要：世界行动模型（WAMs）通过将视频动态先验与策略耦合，已在桌面操作任务中展现出令人鼓舞的结果。然而，由于需要在高维视频-动作隐空间上进行迭代去噪，这类模型在实时仿人双足全身操作任务中速度过慢。这一问题被主流的分层范式进一步加剧——该范式下，高层操作策略仅控制上半身，而低层控制器追踪粗略的基座指令——导致上半身与下半身体处于不一致的动作空间中，并使腿部功能退化为仅维持平衡的运动控制。本文提出MotionWAM，一种实时的WAM模型，通过将策略条件建立在视频世界模型的中间去噪特征之上，实现基于单目自摄像机的自主仿人双足全身操作。MotionWAM以统一运动隐空间替代上下半身分离架构，预测覆盖运动、躯干姿态、高度调节、足部交互与手部操作的全身运动标记，并在单一动作空间中完成联合表征。我们设计了三阶段学习框架，逐步将视频世界模型适配至自视角视觉动态及目标仿人体态。在九项真实世界的Unitree G1机器人任务上，MotionWAM实现实时运行，整体成功率较经相同演示数据微调的视觉-语言-动作（VLA）基线模型提升超30%，并能执行上下分层策略无法实现的任务驱动型足部交互。实验结果证明，基于视频预训练的WAM可从桌面操作任务推广至协调、类人的全身仿人控制。