Ghost in the Minecraft: Generally Capable Agents for Open-World Environments via Large Language Models with Text-based Knowledge and Memory

The captivating realm of Minecraft has attracted substantial research interest in recent years, serving as a rich platform for developing intelligent agents capable of functioning in open-world environments. However, the current research landscape predominantly focuses on specific objectives, such as the popular "ObtainDiamond" task, and has not yet shown effective generalization to a broader spectrum of tasks. Furthermore, the current leading success rate for the "ObtainDiamond" task stands at around 20%, highlighting the limitations of Reinforcement Learning (RL) based controllers used in existing methods. To tackle these challenges, we introduce Ghost in the Minecraft (GITM), a novel framework integrates Large Language Models (LLMs) with text-based knowledge and memory, aiming to create Generally Capable Agents (GCAs) in Minecraft. These agents, equipped with the logic and common sense capabilities of LLMs, can skillfully navigate complex, sparse-reward environments with text-based interactions. We develop a set of structured actions and leverage LLMs to generate action plans for the agents to execute. The resulting LLM-based agent markedly surpasses previous methods, achieving a remarkable improvement of +47.5% in success rate on the "ObtainDiamond" task, demonstrating superior robustness compared to traditional RL-based controllers. Notably, our agent is the first to procure all items in the Minecraft Overworld technology tree, demonstrating its extensive capabilities. GITM does not need any GPU for training, but a single CPU node with 32 CPU cores is enough. This research shows the potential of LLMs in developing capable agents for handling long-horizon, complex tasks and adapting to uncertainties in open-world environments. See the project website at https://github.com/OpenGVLab/GITM.

翻译：近年来，Minecraft的迷人领域吸引了大量研究兴趣，成为开发能在开放世界环境中运行的智能体的丰富平台。然而，当前研究主要聚焦于特定目标（如流行的“获取钻石”任务），尚未展现出对更广泛任务的有效泛化能力。此外，当前“获取钻石”任务的成功率最高仅约20%，凸显了现有方法中基于强化学习（RL）控制器的局限性。为应对这些挑战，我们提出“幻境中的幽灵”（GITM）——一种将大语言模型（LLMs）与基于文本的知识和记忆相结合的新型框架，旨在Minecraft中创建通用智能体（GCAs）。这些智能体凭借LLMs的逻辑与常识能力，能够通过文本交互熟练应对复杂、稀疏奖励的环境。我们设计了一套结构化动作，并利用LLMs为智能体生成可执行的动作计划。由此产生的基于LLM的智能体显著超越先前方法，在“获取钻石”任务中实现了+47.5%的成功率提升，展现出相比传统RL基控制器更强的鲁棒性。值得注意的是，我们的智能体是首个获取Minecraft主世界科技树中全部物品的智能体，展现了其广泛能力。GITM无需任何GPU训练，仅需一个配备32个CPU核心的CPU节点即可运行。本研究揭示了LLMs在开发能够处理长期复杂任务、适应开放世界环境不确定性的通用智能体方面的潜力。项目网站见https://github.com/OpenGVLab/GITM。