Video world models can generate realistic futures from a single instruction, but they often fail to track the same physical points consistently across time. As a result, the generated videos appear plausible, yet lack the physical grounding required for reliable action execution, such as robot manipulation. We present GEM-4D, a geometry-grounded video world model that resolves this limitation by injecting dense 4D correspondence supervision distilled from a pretrained geometry foundation model into the video generative backbone during training. This supervision enables the model to jointly capture appearance and geometric structure while retaining a single-stream architecture with no additional inference cost. We further introduce an inverse dynamics module that converts correspondence-consistent video rollouts into executable robot trajectories, enabling direct deployment in both real-world and simulated manipulation. GEM-4D achieves state-of-the-art performance on both video prediction and geometric consistency across both simulation and realistic scenarios and improves real-world manipulation success from 61% to 81%. Additional results are available at https://gem-4d.github.io/.
翻译:视频世界模型能够从单一指令中生成逼真的未来画面,但它们往往无法在时间维度上持续追踪相同的物理点。因此,生成的视频看似合理,却缺乏可靠动作执行(如机器人操作)所需的物理基础。我们提出GEM-4D,一种几何基础的视频世界模型,通过将预训练几何基础模型提炼的密集四维对应监督信号注入视频生成主干网络进行训练,从而解决上述局限性。该监督信号使模型能够联合捕捉外观与几何结构,同时保持单流架构且不增加推理成本。我们进一步引入逆动力学模块,将具有对应一致性的视频展开转化为可执行的机器人轨迹,实现无需再训练即可在真实世界和模拟环境中直接部署。GEM-4D在视频预测和几何一致性方面均达到当前最优性能,涵盖模拟与真实场景,并将真实世界操作成功率从61%提升至81%。更多结果请访问https://gem-4d.github.io/。