OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation

World Action Models (WAMs) enhance Vision-Language-Action policies by jointly predicting scene evolution and robot actions, but existing methods usually represent the predicted world as holistic images, video tokens, or global latents. These representations are difficult for an action decoder to address when an instruction refers to a particular object, especially under scene shifts where object identity is entangled with context. We propose OA-WAM, an Object-Addressable World Action Model for robust robot manipulation. OA-WAM decomposes each frame into N+1 slot states, with one robot slot and N object slots. Each slot contains a persistent address vector and a time-varying content vector, and is fused with text, image, proprioception, and past-action tokens in a block-causal sequence. A world head predicts next-frame slot states, while a flow-matching action head decodes a 16-step continuous action chunk in the same forward pass. Addressability is enforced by routing cross-slot attention through address-only keys and resetting the address slice at every transformer layer, separating which object to act on from what that object currently is without adding extra tokens. OA-WAM matches strong VLA and WAM baselines on LIBERO (97.8%) and SimplerEnv (79.3%), reaches state-of-the-art performance on the most relevant LIBERO-Plus geometric axes, and remains competitive on the seven-axis aggregate. A causal slot-intervention test yields a swap-binding cosine of 0.87, versus at most 0.09 for holistic baselines. These results suggest that addressable object states provide an effective interface for robust world-action modeling under scene perturbations.

翻译：世界动作模型（WAMs）通过联合预测场景演变和机器人动作来增强视觉-语言-动作策略，但现有方法通常将预测的世界表示为整体图像、视频令牌或全局潜变量。当指令指向特定对象时，尤其在对象身份与情境相纠缠的场景迁移下，这类表示难以被动作解码器定位。我们提出OA-WAM——一种面向对象可寻址的世界动作模型，用于鲁棒机器人操作。OA-WAM将每一帧分解为N+1个槽状态：一个机器人槽和N个对象槽。每个槽包含持久的地址向量和时变的内容向量，并与文本、图像、本体感觉及历史动作令牌以块因果序列方式融合。世界头预测下一帧的槽状态，而流匹配动作头在同一前向传播中解码16步连续动作块。可寻址性通过将跨槽注意力路由至仅含地址的键，并在每层Transformer中重置地址切片实现，从而在不引入额外令牌的情况下，分离“作用于哪个对象”与“该对象当前是什么”。OA-WAM在LIBERO（97.8%）和SimplerEnv（79.3%）上匹配强VLA和WAM基线，在最具相关性的LIBERO-Plus几何轴上达到最优性能，并在七轴聚合指标上保持竞争力。因果槽干预测试得到交换绑定余弦值为0.87，而整体基线最高仅为0.09。这些结果表明，可寻址对象状态为场景扰动下的鲁棒世界动作建模提供了有效接口。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

《支持作战级人机协同智能的交互式OODA流程》

专知会员服务

24+阅读 · 6月7日

综述 | 机器人操作世界模型：预测、行动接口与学习生命周期

专知会员服务

10+阅读 · 6月3日

【综述】世界模型：架构、方法、推理与应用全景

专知会员服务

30+阅读 · 6月2日

世界动作模型: 具身AI的下一个前沿

专知会员服务

22+阅读 · 5月13日