Reinforcement learning (RL) agents typically learn tabula rasa, without prior knowledge of the world. However, if initialized with knowledge of high-level subgoals and transitions between subgoals, RL agents could utilize this Abstract World Model (AWM) for planning and exploration. We propose using few-shot large language models (LLMs) to hypothesize an AWM, that will be verified through world experience, to improve sample efficiency of RL agents. Our DECKARD agent applies LLM-guided exploration to item crafting in Minecraft in two phases: (1) the Dream phase where the agent uses an LLM to decompose a task into a sequence of subgoals, the hypothesized AWM; and (2) the Wake phase where the agent learns a modular policy for each subgoal and verifies or corrects the hypothesized AWM. Our method of hypothesizing an AWM with LLMs and then verifying the AWM based on agent experience not only increases sample efficiency over contemporary methods by an order of magnitude but is also robust to and corrects errors in the LLM, successfully blending noisy internet-scale information from LLMs with knowledge grounded in environment dynamics.
翻译:强化学习(RL)智能体通常从零开始学习,缺乏对世界的先验知识。然而,若预先初始化高层子目标及子目标间转移的知识,RL智能体便可利用此抽象世界模型(AWM)进行规划与探索。我们提出使用小样本大语言模型(LLMs)来假设一个AWM,并通过世界经验进行验证,以提高RL智能体的样本效率。我们的DECKARD智能体将LLM引导的探索应用于《我的世界》中的物品制作,分为两个阶段:(1)“梦想”阶段,智能体使用LLM将任务分解为一系列子目标(即假设的AWM);(2)“觉醒”阶段,智能体为每个子目标学习模块化策略,并验证或修正假设的AWM。我们提出的方法——通过LLM假设AWM,再基于智能体经验验证该AWM——不仅将样本效率较当代方法提升一个数量级,而且对LLM中的错误具有鲁棒性并能够修正这些错误,成功地将LLM中包含的嘈杂互联网规模信息与基于环境动力的知识相融合。