World models are progressively being employed across diverse fields, extending from basic environment simulation to complex scenario construction. However, existing models are mainly trained on domain-specific states and actions, and confined to single-modality state representations. In this paper, We introduce WorldGPT, a generalist world model built upon Multimodal Large Language Model (MLLM). WorldGPT acquires an understanding of world dynamics through analyzing millions of videos across various domains. To further enhance WorldGPT's capability in specialized scenarios and long-term tasks, we have integrated it with a novel cognitive architecture that combines memory offloading, knowledge retrieval, and context reflection. As for evaluation, we build WorldNet, a multimodal state transition prediction benchmark encompassing varied real-life scenarios. Conducting evaluations on WorldNet directly demonstrates WorldGPT's capability to accurately model state transition patterns, affirming its effectiveness in understanding and predicting the dynamics of complex scenarios. We further explore WorldGPT's emerging potential in serving as a world simulator, helping multimodal agents generalize to unfamiliar domains through efficiently synthesising multimodal instruction instances which are proved to be as reliable as authentic data for fine-tuning purposes. The project is available on \url{https://github.com/DCDmllm/WorldGPT}.
翻译:世界模型正逐渐应用于从基础环境模拟到复杂场景构建的多个领域。然而,现有模型主要针对特定领域的状态与动作进行训练,且局限于单模态状态表征。本文提出WorldGPT,一种基于多模态大语言模型(MLLM)构建的通用世界模型。WorldGPT通过分析跨领域的数百万视频数据来学习世界动态规律。为增强WorldGPT在专业场景与长期任务中的能力,我们为其集成了一种结合记忆卸载、知识检索与上下文反思的新型认知架构。在评估方面,我们构建了WorldNet——一个涵盖多样化现实场景的多模态状态转移预测基准。在WorldNet上的直接评估表明,WorldGPT能够准确建模状态转移模式,证实了其在理解与预测复杂场景动态方面的有效性。我们进一步探索了WorldGPT作为世界模拟器的潜在能力:通过高效合成多模态指令实例(实验证明其用于微调的效果与真实数据相当),帮助多模态智能体泛化至陌生领域。项目发布于 \url{https://github.com/DCDmllm/WorldGPT}。