Interactive world models aim to simulate environment dynamics under real-time user actions. However, their action vocabulary is largely confined to navigation: most actions correspond to motion (e.g., walk, turn, look around), while interaction with objects in the scene (e.g., pick up plates, open doors, or trigger physical responses) is either absent, restricted to game domains, or relegated to prompt-to-full-video scenarios. The resulting worlds are visually explorable but not truly actionable. In this work, we present ActWorld, an interactive world model that extends prior navigation-centric generators to support mid-rollout object interaction within a chunk-autoregressive framework. We argue that the navigation-interaction gap stems from two bottlenecks. First, a data bottleneck: the lack of human-object interaction data with accurate, dense labels. Second, a memory bottleneck: recency-biased history compression in existing world models discards the event-transition frames that causally determine subsequent object states, leading to an action-forgetting pathology. On the data side, we construct a 100K interaction video dataset, each annotated with per-chunk captions via chain-of-thought reasoning. On the model side, we introduce a hierarchical action-aware memory design that routes history compression by interaction importance, complemented by a persistent memory bank that maintains event-update and object-identity tokens across long rollouts. Experiments show that ActWorld supports both flexible navigation and rich object interaction within a single model, substantially improving interaction fidelity over navigation-only baselines without sacrificing viewpoint control. Project page is available at https://interactwm.github.io/ActWorld.
翻译:交互式世界模型旨在模拟用户在实时操作下的环境动态。然而,其动作词汇表主要局限于导航:大多数动作对应运动(如行走、转身、环顾四周),而与场景中物体的交互(如拿起盘子、开门或触发物理响应)要么缺失、仅限于游戏领域,要么被降级为“提示生成完整视频”的场景。由此产生的世界在视觉上可探索,但无法真正操作。本文提出ActWorld,一种交互式世界模型,它在分块自回归框架内扩展了以导航为核心的生成器,支持生成过程中的物体交互。我们认为导航与交互之间的差距源于两个瓶颈。首先是数据瓶颈:缺乏具有精确、密集标注的人-物交互数据。其次是记忆瓶颈:现有世界模型中的近期偏好历史压缩机制丢弃了事件转换帧,而这些帧因果决定了后续物体状态,导致了动作遗忘病理。在数据方面,我们构建了一个包含10万段交互视频的数据集,每段视频均通过思维链推理标注了每分块的描述。在模型方面,我们引入了一种分层动作感知记忆设计,根据交互重要性对历史压缩进行路由,并辅以一个持久记忆库,该记忆库在长序列生成过程中维护事件更新与物体身份令牌。实验表明,ActWorld能够在单一模型中同时支持灵活导航与丰富的物体交互,相比纯导航基线方法显著提升了交互逼真度,且不牺牲视角控制能力。项目页面:https://interactwm.github.io/ActWorld。