HarmoWAM: Harmonizing Generalizable and Precise Manipulation via Adaptive World Action Models

World Action Models (WAMs) have emerged as a promising paradigm for robot control by modeling physical dynamics. Current WAMs generally follow two paradigms: the "Imagine-then-Execute" approach, which uses video prediction to infer actions via inverse dynamics, and the "Joint Modeling" approach, which jointly models actions and video representations. Based on systematic experiments, we observe a fundamental trade-off between these paradigms: the former explicitly leverages world models for generalizable transit but lacks interaction precision, whereas the latter enables fine-grained, temporally coherent action generation but is constrained by the exploration space of the training distribution. Motivated by these findings, we propose HarmoWAM, an end-to-end WAM that fully leverages a world model to unify predictive and reactive control, enabling both generalizable transit and precise manipulation. Specifically, the world model provides spatio-temporal physical priors that condition two complementary action experts: a predictive expert that leverages latent dynamics for iterative action generation, and a reactive expert that directly infers actions from predicted visual evolution. To enable adaptive coordination, a Process-Adaptive Gating Mechanism is proposed to automatically determine the timing and location of switching between them. This allows the world model to drive the reactive expert to expand the exploration space and the predictive expert to perform precise interactions across different stages of a task. For evaluation, we construct three training-unseen test environments across six real-world robotic tasks, covering variations in background, position, and object semantics. Notably, HarmoWAM achieves strong zero-shot generalization across these scenarios, significantly outperforming prior state-of-the-art VLA models and WAMs by margins of 33% and 29%, respectively.

翻译：世界动作模型（WAMs）通过建模物理动力学，已成为机器人控制领域一种前景广阔的新范式。当前WAMs主要遵循两种范式：“先想象后执行”方法，即利用视频预测通过逆动力学推断动作；以及“联合建模”方法，即对动作和视频表示进行联合建模。基于系统性实验，我们观察到这两种范式之间存在根本性的权衡：前者显式利用世界模型实现可泛化的状态转换，但缺乏交互精度；而后者能够生成精细且时间连贯的动作，但受限于训练分布的探索空间。受这些发现启发，我们提出了HarmoWAM——一种端到端的WAM，它充分利用世界模型来统一预测式控制与反应式控制，从而同时实现可泛化的状态转换和精细的操作。具体而言，世界模型提供时空物理先验，用于条件调控两个互补的动作专家：预测专家利用潜在动力学进行迭代动作生成，而反应专家则直接从预测的视觉演变中推断动作。为实现自适应协调，我们提出了一种过程自适应门控机制，能够自动确定两者之间切换的时机和位置。这使得世界模型能够驱动反应专家扩展探索空间，并驱动预测专家在任务的不同阶段执行精确交互。为进行评估，我们构建了三个训练阶段未见过的测试环境，涵盖六项真实世界的机器人任务，涉及背景、位置和物体语义的变化。值得注意的是，HarmoWAM在这些场景中实现了强大的零样本泛化能力，分别以33%和29%的显著优势超越了先前最先进的VLA模型和WAMs。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

综述 | 机器人操作世界模型：预测、行动接口与学习生命周期

专知会员服务

9+阅读 · 6月3日

【综述】世界模型：架构、方法、推理与应用全景

专知会员服务

27+阅读 · 6月2日

世界动作模型: 具身AI的下一个前沿

专知会员服务

22+阅读 · 5月13日

智能体化世界建模：基础、能力、规律及展望

专知会员服务

23+阅读 · 4月28日