Embodied action planning is a core challenge in robotics, requiring models to generate precise actions from visual observations and language instructions. While video generation world models are promising, their reliance on pixel-level reconstruction often introduces visual redundancies that hinder action decoding and generalization. Latent world models offer a compact, motion-aware representation, but overlook the fine-grained details critical for precise manipulation. To overcome these limitations, we propose MoWM, a mixture-of-world-model framework that fuses representations from hybrid world models for embodied action planning. Our approach combines motion-aware latent world model features with pixel-space features, enabling MoWM to emphasize action-relevant visual details for action decoding. Extensive evaluations on the CALVIN and real-world manipulation tasks demonstrate that our method achieves state-of-the-art task success rates and superior generalization. We also provide a comprehensive analysis of the strengths of each feature space, offering valuable insights for future research in embodied planning. The code is available at: https://github.com/tsinghua-fib-lab/MoWM.
翻译:具身动作规划是机器人领域的核心挑战,要求模型能够根据视觉观测和语言指令生成精确动作。尽管视频生成世界模型展现出潜力,但其对像素级重建的依赖常引入视觉冗余,阻碍动作解码与泛化。隐空间世界模型提供紧凑且运动感知的表征,但忽略了精细细节——这对精确操控至关重要。为克服这些局限,我们提出MoWM,一种混合世界模型框架,通过融合来自异构世界模型的表征来实现具身动作规划。我们的方法将运动感知的隐空间世界模型特征与像素空间特征相结合,使MoWM能够强化动作相关的视觉细节以进行动作解码。在CALVIN仿真环境及真实世界操控任务上的大量实验表明,本方法取得了最优的任务成功率与卓越的泛化性能。我们还对各特征空间的优势进行了全面分析,为具身规划的未来研究提供了重要参考。代码已开源:https://github.com/tsinghua-fib-lab/MoWM。