WildWorld: A Large-Scale Dataset for Dynamic World Modeling with Actions and Explicit State toward Generative ARPG

Dynamical systems theory and reinforcement learning view world evolution as latent-state dynamics driven by actions, with visual observations providing partial information about the state. Recent video world models attempt to learn this action-conditioned dynamics from data. However, existing datasets rarely match the requirement: they typically lack diverse and semantically meaningful action spaces, and actions are directly tied to visual observations rather than mediated by underlying states. As a result, actions are often entangled with pixel-level changes, making it difficult for models to learn structured world dynamics and maintain consistent evolution over long horizons. In this paper, we propose WildWorld, a large-scale action-conditioned world modeling dataset with explicit state annotations, automatically collected from a photorealistic AAA action role-playing game (Monster Hunter: Wilds). WildWorld contains over 108 million frames and features more than 450 actions, including movement, attacks, and skill casting, together with synchronized per-frame annotations of character skeletons, world states, camera poses, and depth maps. We further derive WildBench to evaluate models through Action Following and State Alignment. Extensive experiments reveal persistent challenges in modeling semantically rich actions and maintaining long-horizon state consistency, highlighting the need for state-aware video generation. The project page is https://shandaai.github.io/wildworld-project/.

翻译：动力系统理论与强化学习将世界演化视为由动作驱动的潜在状态动态过程，而视觉观测仅提供关于状态的部分信息。近期视频世界模型试图从数据中学习这种基于动作的条件动态。然而现有数据集极少满足该需求：它们通常缺乏多样且具有语义意义的动作空间，且动作直接关联视觉观测而非通过底层状态中介。这导致动作常与像素级变化纠缠不清，使得模型难以学习结构化的世界动态，并难以在长时域中维持一致性演化。本文提出WildWorld——一个具有显式状态标注的大规模动作条件世界建模数据集，该数据通过从逼真的AAA动作角色扮演游戏（《怪物猎人：荒野》）中自动采集获得。WildWorld包含超过1.08亿帧画面及450余种动作（涵盖移动、攻击、技能施放），同时提供同步逐帧标注的角色骨骼、世界状态、相机姿态及深度图。我们进一步构建WildBench评估基准，通过动作遵循度与状态对齐度两个维度衡量模型性能。大量实验揭示了建模语义丰富动作和维持长时域状态一致性的持续性挑战，凸显了状态感知视频生成的必要性。项目主页：https://shandaai.github.io/wildworld-project/。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

世界动作模型: 具身AI的下一个前沿

专知会员服务

0+阅读 · 今天12:28

智能体化世界建模：基础、能力、规律及展望

专知会员服务

19+阅读 · 4月28日

【NVDIA】世界动作模型是零样本策略

专知会员服务

13+阅读 · 2月21日

具身智能中的世界模型：全面综述

专知会员服务

51+阅读 · 2025年10月21日