DreamX-World 1.0: A General-Purpose Interactive World Model

DreamX Team,Yancheng Bai,Rui Chen,Xiangxiang Chu,Rujing Dang,Hao Dou,Bingjie Gao,Qiwen Gu,Siyu Hong,Jiachen Lei,Geng Li,Jifan Li,Ruimin Lin,Qingfeng Shi,Bingze Song,Lei Sun,Jing Tang,Ruitian Tian,Jun Wang,Jiahong Wu,Pengfei Zhang,Shen Zhang,Jiashu Zhu

from arxiv, Project page: https://amap-ml.github.io/DreamX_World, Code: https://github.com/AMAP-ML/DreamX-World

DreamX-World 1.0 is a general-purpose interactive text/image-to-video world model for controllable long-horizon generation. It supports camera navigation, revisits to previously observed regions, and promptable events across photorealistic, game-style, and stylized domains. Our data engine combines camera-accurate Unreal Engine rendering, action-rich gameplay recordings, and real-world videos with recovered camera geometry. For camera control, we introduce E-PRoPE, a lightweight variant of projective positional encoding that retains PRoPE's projective camera geometry while applying camera-aware attention to spatially reduced tokens. We convert a bidirectional video generator into a few-step autoregressive world model using causal forcing, DMD-style distillation, and long-rollout training. Training on self-generated long-horizon contexts exposes the model to its own generated history and reduces the style and color drift that accumulates across autoregressive chunks. Memory-Conditioned Scene Persistence retrieves earlier views through camera-geometry-based retrieval, while residual recycling makes the conditioning path less sensitive to imperfect memory latents. Event Instruction Tuning adds composable event control, and reinforcement learning alignment recovers camera control and visual quality after distillation. With mixed-precision DiT execution, residual reuse, 75\%-pruned VAE decoding, and asynchronous pipeline parallelism, DreamX-World 1.0 reaches up to 16\,FPS on eight RTX\,5090 GPUs. On our 5-second basic evaluation, DreamX-World 1.0 achieves a camera-control score of 73.75 and an overall score of 84.76, outperforming HY-WorldPlay 1.5 and LingBot-World in overall score, which achieve 80.79 and 80.45, respectively.

翻译：DreamX-World 1.0 是一种通用的交互式文本/图像到视频的世界模型，用于可控的长时程生成。它支持在逼真、游戏风格和艺术化领域中进行相机导航、对先前观察区域的回访以及可提示事件。我们的数据引擎结合了相机精确的Unreal Engine渲染、富含动作的游戏录制以及带有恢复相机几何数据的真实世界视频。在相机控制方面，我们引入了E-PRoPE，这是投影式位置编码的一种轻量级变体，它保留了PRoPE的投影相机几何特性，同时对空间缩减的token应用了相机感知注意力。我们通过因果强迫、DMD风格蒸馏和长展开训练，将双向视频生成器转化为少步自回归世界模型。在自生成的长时程上下文中进行训练，使模型暴露于自身生成的历史中，减少了跨自回归片段累积的风格和颜色漂移。基于相机几何的检索，记忆条件场景持久化可提取先前视角，而残差循环则使条件路径对不完美的记忆隐变量不那么敏感。事件指令微调增加了可组合的事件控制，而强化学习对齐则在蒸馏后恢复了相机控制和视觉质量。通过混合精度的DiT执行、残差复用、75%剪枝的VAE解码以及异步流水线并行，DreamX-World 1.0 在八块RTX 5090 GPU上可达16帧/秒。在我们的5秒基础评估中，DreamX-World 1.0 的相机控制得分为73.75，总分为84.76，在总分上优于HY-WorldPlay 1.5和LingBot-World，后两者的总分分别为80.79和80.45。