World models are progressively being employed across diverse fields, extending from basic environment simulation to complex scenario construction. However, existing models are mainly trained on domain-specific states and actions, and confined to single-modality state representations. In this paper, We introduce WorldGPT, a generalist world model built upon Multimodal Large Language Model (MLLM). WorldGPT acquires an understanding of world dynamics through analyzing millions of videos across various domains. To further enhance WorldGPT's capability in specialized scenarios and long-term tasks, we have integrated it with a novel cognitive architecture that combines memory offloading, knowledge retrieval, and context reflection. As for evaluation, we build WorldNet, a multimodal state transition prediction benchmark encompassing varied real-life scenarios. Conducting evaluations on WorldNet directly demonstrates WorldGPT's capability to accurately model state transition patterns, affirming its effectiveness in understanding and predicting the dynamics of complex scenarios. We further explore WorldGPT's emerging potential in serving as a world simulator, helping multimodal agents generalize to unfamiliar domains through efficiently synthesising multimodal instruction instances which are proved to be as reliable as authentic data for fine-tuning purposes. The project is available on \url{https://github.com/DCDmllm/WorldGPT}.
翻译:世界模型正逐步被应用于从基础环境模拟到复杂场景构建等多个领域。然而,现有模型主要基于特定领域的状态和动作进行训练,且局限于单模态状态表示。本文提出WorldGPT——一种基于多模态大语言模型(MLLM)构建的通用世界模型。WorldGPT通过分析跨领域的数百万视频,习得了对世界动态的理解。为进一步增强WorldGPT在专业场景及长期任务中的能力,我们为其融合了一种新型认知架构,该架构结合了记忆卸载、知识检索与情境反思。在评估方面,我们构建了WorldNet——一个涵盖多种现实生活场景的多模态状态转移预测基准。在WorldNet上的评估直接证明了WorldGPT能准确建模状态转移模式,验证了其在理解与预测复杂场景动态方面的有效性。我们进一步探索了WorldGPT作为世界模拟器的新兴潜力,通过高效合成多模态指令实例,帮助多模态智能体泛化至陌生领域。实验证明,这些合成实例在微调任务中与真实数据同样可靠。项目地址:\url{https://github.com/DCDmllm/WorldGPT}。