World Action Models (WAMs) have emerged as a promising paradigm for robot control by modeling physical dynamics. Current WAMs generally follow two paradigms: the "Imagine-then-Execute" approach, which uses video prediction to infer actions via inverse dynamics, and the "Joint Modeling" approach, which jointly models actions and video representations. Based on systematic experiments, we observe a fundamental trade-off between these paradigms: the former explicitly leverages world models for generalizable transit but lacks interaction precision, whereas the latter enables fine-grained, temporally coherent action generation but is constrained by the exploration space of the training distribution. Motivated by these findings, we propose HarmoWAM, an end-to-end WAM that fully leverages a world model to unify predictive and reactive control, enabling both generalizable transit and precise manipulation. Specifically, the world model provides spatio-temporal physical priors that condition two complementary action experts: a predictive expert that leverages latent dynamics for iterative action generation, and a reactive expert that directly infers actions from predicted visual evolution. To enable adaptive coordination, a Process-Adaptive Gating Mechanism is proposed to automatically determine the timing and location of switching between them. This allows the world model to drive the reactive expert to expand the exploration space and the predictive expert to perform precise interactions across different stages of a task. For evaluation, we construct three training-unseen test environments across six real-world robotic tasks, covering variations in background, position, and object semantics. Notably, HarmoWAM achieves strong zero-shot generalization across these scenarios, significantly outperforming prior state-of-the-art VLA models and WAMs by margins of 33% and 29%, respectively.
翻译:世界动作模型(WAMs)通过建模物理动力学,已成为机器人控制领域一种前景广阔的新范式。当前WAMs主要遵循两种范式:“先想象后执行”方法,即利用视频预测通过逆动力学推断动作;以及“联合建模”方法,即对动作和视频表示进行联合建模。基于系统性实验,我们观察到这两种范式之间存在根本性的权衡:前者显式利用世界模型实现可泛化的状态转换,但缺乏交互精度;而后者能够生成精细且时间连贯的动作,但受限于训练分布的探索空间。受这些发现启发,我们提出了HarmoWAM——一种端到端的WAM,它充分利用世界模型来统一预测式控制与反应式控制,从而同时实现可泛化的状态转换和精细的操作。具体而言,世界模型提供时空物理先验,用于条件调控两个互补的动作专家:预测专家利用潜在动力学进行迭代动作生成,而反应专家则直接从预测的视觉演变中推断动作。为实现自适应协调,我们提出了一种过程自适应门控机制,能够自动确定两者之间切换的时机和位置。这使得世界模型能够驱动反应专家扩展探索空间,并驱动预测专家在任务的不同阶段执行精确交互。为进行评估,我们构建了三个训练阶段未见过的测试环境,涵盖六项真实世界的机器人任务,涉及背景、位置和物体语义的变化。值得注意的是,HarmoWAM在这些场景中实现了强大的零样本泛化能力,分别以33%和29%的显著优势超越了先前最先进的VLA模型和WAMs。