World models empower model-based agents to interactively explore, reason, and plan within imagined environments for real-world decision-making. However, the high demand for interactivity poses challenges in harnessing recent advancements in video generative models for developing world models at scale. This work introduces Interactive VideoGPT (iVideoGPT), a scalable autoregressive transformer framework that integrates multimodal signals--visual observations, actions, and rewards--into a sequence of tokens, facilitating an interactive experience of agents via next-token prediction. iVideoGPT features a novel compressive tokenization technique that efficiently discretizes high-dimensional visual observations. Leveraging its scalable architecture, we are able to pre-train iVideoGPT on millions of human and robotic manipulation trajectories, establishing a versatile foundation that is adaptable to serve as interactive world models for a wide range of downstream tasks. These include action-conditioned video prediction, visual planning, and model-based reinforcement learning, where iVideoGPT achieves competitive performance compared with state-of-the-art methods. Our work advances the development of interactive general world models, bridging the gap between generative video models and practical model-based reinforcement learning applications. Code and pre-trained models are available at https://thuml.github.io/iVideoGPT.
翻译:世界模型使基于模型的智能体能够在想象环境中进行交互式探索、推理和规划,从而实现现实世界的决策。然而,对交互性的高要求给利用视频生成模型的最新进展来开发大规模世界模型带来了挑战。本研究提出了交互式视频生成预训练模型(iVideoGPT),这是一个可扩展的自回归Transformer框架,它将多模态信号——视觉观测、动作和奖励——整合到令牌序列中,通过下一令牌预测实现智能体的交互体验。iVideoGPT采用了一种新颖的压缩令牌化技术,能够高效地将高维视觉观测离散化。借助其可扩展的架构,我们能够在数百万条人类和机器人操作轨迹上对iVideoGPT进行预训练,从而建立一个通用的基础模型,可适配于广泛下游任务的交互式世界模型。这些任务包括动作条件视频预测、视觉规划和基于模型的强化学习,其中iVideoGPT相比最先进方法取得了具有竞争力的性能。我们的工作推动了交互式通用世界模型的发展,弥合了生成式视频模型与实用基于模型的强化学习应用之间的差距。代码和预训练模型可在 https://thuml.github.io/iVideoGPT 获取。