World models derived from large-scale video generative pre-training have emerged as a promising paradigm for generalist robot policy learning. However, standard approaches often focus on high-fidelity RGB video prediction, this can result in overfitting to irrelevant factors, such as dynamic backgrounds and illumination changes. These distractions reduce the model's ability to generalize, ultimately leading to unreliable and fragile control policies. To address this, we introduce the Mask World Model (MWM), which leverages video diffusion architectures to predict the evolution of semantic masks instead of pixels. This shift imposes a geometric information bottleneck, forcing the model to capture essential physical dynamics and contact relations while filtering out visual noise. We seamlessly integrate this mask dynamics backbone with a diffusion-based policy head to enable robust end-to-end control. Extensive evaluations demonstrate the superiority of MWM on the LIBERO and RLBench simulation benchmarks, significantly outperforming the state-of-the-art RGB-based world models. Furthermore, real-world experiments and robustness evaluation (via random token pruning) reveal that MWM exhibits superior generalization capabilities and robust resilience to texture information loss.
翻译:从大规模视频生成预训练中提取的世界模型已成为通用机器人策略学习的一种有前景的范式。然而,标准方法通常侧重于高保真RGB视频预测,这可能导致对动态背景和光照变化等无关因素的过拟合。这些干扰因素削弱了模型的泛化能力,最终导致不可靠且脆弱的控制策略。为解决此问题,我们提出掩模世界模型(MWM),该模型利用视频扩散架构预测语义掩模的演变而非像素值。这一转变引入了几何信息瓶颈,迫使模型捕捉必要的物理动力学和接触关系,同时过滤掉视觉噪声。我们将此掩模动力学骨干网络与基于扩散的策略头无缝集成,以实现鲁棒的端到端控制。大量评估表明,MWM在LIBERO和RLBench模拟基准测试中具有优越性,显著优于基于RGB的当前最优世界模型。此外,真实世界实验和鲁棒性评估(通过随机令牌剪枝)揭示,MWM展现出卓越的泛化能力和对纹理信息丢失的强韧性。