Video world models have shown immense promise for interactive simulation and entertainment, but current systems still struggle with two important aspects of interactivity: user control over the environment for reproducible, editable experiences, and shared inference where players hold influence over a common world. To address these limitations, we introduce an explicit external memory into the system, a persistent state operating independent of the model's context window, that is continually updated by user actions and queried throughout the generation roll-out. Unlike conventional diffusion game engines that operate as next-frame predictors, our approach decomposes generation into Memory, Observation, and Dynamics modules. This design gives users direct, editable control over environment structure via an editable memory representation, and it naturally extends to real-time multiplayer rollouts with coherent viewpoints and consistent cross-player interactions.
翻译:视频世界模型在交互式模拟和娱乐领域展现出了巨大潜力,但当前系统在交互性的两个重要方面仍面临挑战:用户对环境的控制能力(以实现可复现、可编辑的体验),以及玩家共同影响世界的共享推理机制。为解决这些局限,我们向系统引入显式外部记忆——一种独立于模型上下文窗口运行的持久化状态,该状态持续被用户行为更新,并在生成推理过程中被查询。相较于传统扩散游戏引擎作为下一帧预测器的运作方式,我们的方法将生成过程分解为记忆、观测与动力学三大模块。这种设计通过可编辑的记忆表征赋予用户对环境结构的直接可编辑控制权,并自然延伸至具有连贯视角与一致跨玩家交互的实时多人推理过程。