Dynamical systems theory and reinforcement learning view world evolution as latent-state dynamics driven by actions, with visual observations providing partial information about the state. Recent video world models attempt to learn this action-conditioned dynamics from data. However, existing datasets rarely match the requirement: they typically lack diverse and semantically meaningful action spaces, and actions are directly tied to visual observations rather than mediated by underlying states. As a result, actions are often entangled with pixel-level changes, making it difficult for models to learn structured world dynamics and maintain consistent evolution over long horizons. In this paper, we propose WildWorld, a large-scale action-conditioned world modeling dataset with explicit state annotations, automatically collected from a photorealistic AAA action role-playing game (Monster Hunter: Wilds). WildWorld contains over 108 million frames and features more than 450 actions, including movement, attacks, and skill casting, together with synchronized per-frame annotations of character skeletons, world states, camera poses, and depth maps. We further derive WildBench to evaluate models through Action Following and State Alignment. Extensive experiments reveal persistent challenges in modeling semantically rich actions and maintaining long-horizon state consistency, highlighting the need for state-aware video generation. The project page is https://shandaai.github.io/wildworld-project/.
翻译:动力系统理论与强化学习将世界演化视为由动作驱动的潜在状态动态过程,而视觉观测仅提供关于状态的部分信息。近期视频世界模型试图从数据中学习这种基于动作的条件动态。然而现有数据集极少满足该需求:它们通常缺乏多样且具有语义意义的动作空间,且动作直接关联视觉观测而非通过底层状态中介。这导致动作常与像素级变化纠缠不清,使得模型难以学习结构化的世界动态,并难以在长时域中维持一致性演化。本文提出WildWorld——一个具有显式状态标注的大规模动作条件世界建模数据集,该数据通过从逼真的AAA动作角色扮演游戏(《怪物猎人:荒野》)中自动采集获得。WildWorld包含超过1.08亿帧画面及450余种动作(涵盖移动、攻击、技能施放),同时提供同步逐帧标注的角色骨骼、世界状态、相机姿态及深度图。我们进一步构建WildBench评估基准,通过动作遵循度与状态对齐度两个维度衡量模型性能。大量实验揭示了建模语义丰富动作和维持长时域状态一致性的持续性挑战,凸显了状态感知视频生成的必要性。项目主页:https://shandaai.github.io/wildworld-project/。