What if a video generation model could not only imagine a plausible future, but the correct one, accurately reflecting how the world changes with each action? We address this question by presenting the Egocentric World Model (EgoWM), a simple, architecture-agnostic method that transforms any pretrained video diffusion model into an action-conditioned world model, enabling controllable future prediction. Rather than training from scratch, we repurpose the rich world priors of Internet-scale video models and inject motor commands through lightweight conditioning layers. This allows the model to follow actions faithfully while preserving realism and strong generalization. Our approach scales naturally across embodiments and action spaces, ranging from 3-DoF mobile robots to 25-DoF humanoids, where predicting egocentric joint-angle-driven dynamics is substantially more challenging. The model produces coherent rollouts for both navigation and manipulation tasks, requiring only modest fine-tuning. To evaluate physical correctness independently of visual appearance, we introduce the Structural Consistency Score (SCS), which measures whether stable scene elements evolve consistently with the provided actions. EgoWM improves SCS by up to 80 percent over prior state-of-the-art navigation world models, while achieving up to six times lower inference latency and robust generalization to unseen environments, including navigation inside paintings.
翻译:如果一个视频生成模型不仅能想象一个合理的未来,还能想象一个正确的未来,精确反映世界如何随每个动作而变化,那会怎样?我们通过提出自我中心世界模型(EgoWM)来探讨这个问题,这是一种简单、架构无关的方法,可将任何预训练的视频扩散模型转化为动作条件化的世界模型,从而实现可控的未来预测。我们并非从头开始训练,而是重新利用互联网规模视频模型所蕴含的丰富世界先验知识,并通过轻量级条件层注入运动指令。这使得模型能够忠实地遵循动作,同时保持真实感和强大的泛化能力。我们的方法能自然地扩展到不同的具身形态和动作空间,从3自由度的移动机器人到25自由度的人形机器人,其中预测由关节角驱动的自我中心动态更具挑战性。该模型能为导航和操作任务生成连贯的推演序列,仅需适度的微调。为了独立于视觉外观来评估物理正确性,我们引入了结构一致性评分(SCS),用于衡量稳定场景元素是否与所提供动作保持一致地演化。EgoWM将SCS相比先前最先进的导航世界模型提升了高达80%,同时实现了高达六倍的推理延迟降低,并对未见环境(包括在画作内部导航)展现出鲁棒的泛化能力。