World Action Models (WAMs) enhance Vision-Language-Action policies by jointly predicting scene evolution and robot actions, but existing methods usually represent the predicted world as holistic images, video tokens, or global latents. These representations are difficult for an action decoder to address when an instruction refers to a particular object, especially under scene shifts where object identity is entangled with context. We propose OA-WAM, an Object-Addressable World Action Model for robust robot manipulation. OA-WAM decomposes each frame into N+1 slot states, with one robot slot and N object slots. Each slot contains a persistent address vector and a time-varying content vector, and is fused with text, image, proprioception, and past-action tokens in a block-causal sequence. A world head predicts next-frame slot states, while a flow-matching action head decodes a 16-step continuous action chunk in the same forward pass. Addressability is enforced by routing cross-slot attention through address-only keys and resetting the address slice at every transformer layer, separating which object to act on from what that object currently is without adding extra tokens. OA-WAM matches strong VLA and WAM baselines on LIBERO (97.8%) and SimplerEnv (79.3%), reaches state-of-the-art performance on the most relevant LIBERO-Plus geometric axes, and remains competitive on the seven-axis aggregate. A causal slot-intervention test yields a swap-binding cosine of 0.87, versus at most 0.09 for holistic baselines. These results suggest that addressable object states provide an effective interface for robust world-action modeling under scene perturbations.
翻译:世界动作模型(WAMs)通过联合预测场景演变和机器人动作来增强视觉-语言-动作策略,但现有方法通常将预测的世界表示为整体图像、视频令牌或全局潜变量。当指令指向特定对象时,尤其在对象身份与情境相纠缠的场景迁移下,这类表示难以被动作解码器定位。我们提出OA-WAM——一种面向对象可寻址的世界动作模型,用于鲁棒机器人操作。OA-WAM将每一帧分解为N+1个槽状态:一个机器人槽和N个对象槽。每个槽包含持久的地址向量和时变的内容向量,并与文本、图像、本体感觉及历史动作令牌以块因果序列方式融合。世界头预测下一帧的槽状态,而流匹配动作头在同一前向传播中解码16步连续动作块。可寻址性通过将跨槽注意力路由至仅含地址的键,并在每层Transformer中重置地址切片实现,从而在不引入额外令牌的情况下,分离“作用于哪个对象”与“该对象当前是什么”。OA-WAM在LIBERO(97.8%)和SimplerEnv(79.3%)上匹配强VLA和WAM基线,在最具相关性的LIBERO-Plus几何轴上达到最优性能,并在七轴聚合指标上保持竞争力。因果槽干预测试得到交换绑定余弦值为0.87,而整体基线最高仅为0.09。这些结果表明,可寻址对象状态为场景扰动下的鲁棒世界动作建模提供了有效接口。