World Action Models (WAMs) present a promising paradigm for robotic control via video prediction. However, current WAMs suffer from fundamental spatial bottlenecks: standard text inputs introduce referential ambiguity in cluttered scenes, while unstructured RGB predictions lack semantic grounding and remain biased by task-irrelevant backgrounds. To overcome these limitations, we introduce MaskWAM, an object-centric world-action model. By jointly integrating masks as both explicit inputs and predictions via a unified Mixture of Transformers (MoT), MaskWAM unlocks robust policy generalization. This design provides two key benefits: (1) predicting future masks yields object-centric semantic supervision that suppresses visual noise, significantly enhancing even standard text-conditioned WAMs; and (2) coupling this predictive supervision with first-frame visual prompts, such as target object masks, establishes a precise spatial anchor that substantially reduces language ambiguity. Crucially, as WAMs are inherently vision-driven architectures, direct mask conditioning yields substantially stronger guidance than text alone, establishing a precise and robust paradigm for manipulating unseen objects. Evaluations on LIBERO, RoboTwin, and real-world tasks demonstrate that MaskWAM significantly outperforms baselines in both language-clear and language-ambiguous tasks.
翻译:世界行动模型通过视频预测为机器人控制提供了一种有前景的范式。然而,当前的世界行动模型存在根本性的空间瓶颈:标准文本输入在杂乱场景中引入指代歧义,而无结构的RGB预测缺乏语义基础,且易受任务无关背景的偏差影响。为克服这些限制,我们提出MaskWAM——一种以物体为中心的世界行动模型。通过利用统一的混合Transformer架构,将掩码同时作为显式输入与预测目标,MaskWAM实现了鲁棒的策略泛化。该设计带来两大优势:(1) 预测未来掩码可提供以物体为中心的语义监督,有效抑制视觉噪声,显著提升标准文本条件世界行动模型的性能;(2) 将这种预测监督与首帧视觉提示(如目标物体掩码)相结合,可建立精确的空间锚点,大幅降低语言歧义。关键在于,由于世界行动模型本质上是视觉驱动的架构,直接进行掩码条件控制能提供比纯文本更强的引导,从而为操控未见物体建立精确且鲁棒的范式。在LIBERO、RoboTwin及真实世界任务上的评估表明,MaskWAM在语言清晰与语言歧义任务中均显著优于基线方法。