World models derived from large-scale video generative pre-training have emerged as a promising paradigm for generalist robot policy learning. However, standard approaches often focus on high-fidelity RGB video prediction, this can result in overfitting to irrelevant factors, such as dynamic backgrounds and illumination changes. These distractions reduce the model's ability to generalize, ultimately leading to unreliable and fragile control policies. To address this, we introduce the Mask World Model (MWM), which leverages video diffusion architectures to predict the evolution of semantic masks instead of pixels. This shift imposes a geometric information bottleneck, forcing the model to capture essential physical dynamics and contact relations while filtering out visual noise. We seamlessly integrate this mask dynamics backbone with a diffusion-based policy head to enable robust end-to-end control. Extensive evaluations demonstrate the superiority of MWM on the LIBERO and RLBench simulation benchmarks, significantly outperforming the state-of-the-art RGB-based world models. Furthermore, real-world experiments and robustness evaluation (via random token pruning) reveal that MWM exhibits superior generalization capabilities and robust resilience to texture information loss.
翻译:基于大规模视频生成式预训练的世界模型已成为通用型机器人策略学习的理想范式。然而,标准方法通常聚焦于高保真度RGB视频预测,这可能导致模型对动态背景、光照变化等无关因素产生过拟合。这些干扰因素削弱了模型的泛化能力,最终导致不可靠且脆弱的控制策略。为此,我们提出掩模世界模型(MWM),该模型利用视频扩散架构预测语义掩模的动态演化而非像素级信息。这一转变引入了几何信息瓶颈机制,迫使模型在过滤视觉噪声的同时捕捉核心物理动力学与接触关系。我们通过将这一掩模动力学主干网络与基于扩散的策略模块无缝集成,实现了鲁棒的端到端控制。大量实验证明,MWM在LIBERO与RLBench仿真基准测试中显著超越基于RGB的最先进世界模型。此外,真实场景实验与鲁棒性评估(通过随机令牌剪枝)表明,MWM展现出卓越的泛化能力以及对纹理信息丢失的强鲁棒性。