GameGen-X: Interactive Open-world Game Video Generation

We introduce GameGen-X, the first diffusion transformer model specifically designed for both generating and interactively controlling open-world game videos. This model facilitates high-quality, open-domain generation by simulating an extensive array of game engine features, such as innovative characters, dynamic environments, complex actions, and diverse events. Additionally, it provides interactive controllability, predicting and altering future content based on the current clip, thus allowing for gameplay simulation. To realize this vision, we first collected and built an Open-World Video Game Dataset from scratch. It is the first and largest dataset for open-world game video generation and control, which comprises over a million diverse gameplay video clips sampling from over 150 games with informative captions from GPT-4o. GameGen-X undergoes a two-stage training process, consisting of foundation model pre-training and instruction tuning. Firstly, the model was pre-trained via text-to-video generation and video continuation, endowing it with the capability for long-sequence, high-quality open-domain game video generation. Further, to achieve interactive controllability, we designed InstructNet to incorporate game-related multi-modal control signal experts. This allows the model to adjust latent representations based on user inputs, unifying character interaction and scene content control for the first time in video generation. During instruction tuning, only the InstructNet is updated while the pre-trained foundation model is frozen, enabling the integration of interactive controllability without loss of diversity and quality of generated video content.

翻译：我们提出了GameGen-X，这是首个专门为生成和交互式控制开放世界游戏视频而设计的扩散Transformer模型。该模型通过模拟广泛的游戏引擎特性（如创新角色、动态环境、复杂动作和多样化事件），实现了高质量、开放领域的视频生成。此外，它提供了交互式可控性，能够基于当前视频片段预测并改变未来内容，从而实现游戏玩法模拟。为实现这一目标，我们首先从零开始收集并构建了开放世界视频游戏数据集。这是首个也是最大的用于开放世界游戏视频生成与控制的数据集，包含超过150款游戏中采样的超过一百万段多样化游戏视频片段，并配有来自GPT-4o的信息丰富描述。GameGen-X采用两阶段训练流程，包括基础模型预训练和指令微调。首先，模型通过文本到视频生成和视频延续任务进行预训练，使其具备生成长序列、高质量开放领域游戏视频的能力。进一步，为实现交互式可控性，我们设计了InstructNet来整合与游戏相关的多模态控制信号专家模块。这使得模型能够根据用户输入调整潜在表示，首次在视频生成中统一了角色交互与场景内容控制。在指令微调阶段，仅更新InstructNet而冻结预训练的基础模型，从而在不损失生成视频内容多样性和质量的前提下，实现了交互式可控性的集成。