Recent advances in video diffusion transformers have enabled interactive gaming world models that allow users to explore generated environments over extended horizons. However, existing approaches struggle with precise action control and long-horizon 3D consistency. Most prior works treat user actions as abstract conditioning signals, overlooking the fundamental geometric coupling between actions and the 3D world, whereby actions induce relative camera motions that accumulate into a global camera pose within a 3D world. In this paper, we establish camera pose as a unifying geometric representation to jointly ground immediate action control and long-term 3D consistency. First, we define a physics-based continuous action space and represent user inputs in the Lie algebra to derive precise 6-DoF camera poses, which are injected into the generative model via a camera embedder to ensure accurate action alignment. Second, we use global camera poses as spatial indices to retrieve relevant past observations, enabling geometrically consistent revisiting of locations during long-horizon navigation. To support this research, we introduce a large-scale dataset comprising 3,000 minutes of authentic human gameplay annotated with camera trajectories and textual descriptions. Extensive experiments show that our approach substantially outperforms state-of-the-art interactive gaming world models in action controllability, long-horizon visual quality, and 3D spatial consistency.
翻译:近期视频扩散Transformer的进展使得交互式游戏世界模型得以实现,允许用户在扩展时间跨度内探索生成的环境。然而,现有方法在精确动作控制和长时序3D一致性方面仍面临挑战。多数先前工作将用户动作视为抽象条件信号,忽视了动作与3D世界之间基本的几何耦合关系——即动作引发相对相机运动,这些运动在3D世界中累积为全局相机位姿。本文提出将相机位姿确立为统一几何表示,以共同支撑即时动作控制与长期3D一致性。首先,我们定义基于物理的连续动作空间,并在李代数中表示用户输入以推导精确的6自由度相机位姿,这些位姿通过相机嵌入器注入生成模型,确保动作的精准对齐。其次,我们利用全局相机位姿作为空间索引检索相关历史观测,实现长时序导航期间对位置的几何一致性重访。为支持本研究,我们构建了一个大规模数据集,包含3000分钟标注相机轨迹与文本描述的真实人类游戏录像。大量实验表明,本方法在动作可控性、长时序视觉质量与3D空间一致性方面显著优于当前最先进的交互式游戏世界模型。