Reinforcement learning (RL) agents typically learn tabula rasa, without prior knowledge of the world, which makes learning complex tasks with sparse rewards difficult. If initialized with knowledge of high-level subgoals and transitions between subgoals, RL agents could utilize this Abstract World Model (AWM) for planning and exploration. We propose using few-shot large language models (LLMs) to hypothesize an AWM, that is tested and verified during exploration, to improve sample efficiency in embodied RL agents. Our DECKARD agent applies LLM-guided exploration to item crafting in Minecraft in two phases: (1) the Dream phase where the agent uses an LLM to decompose a task into a sequence of subgoals, the hypothesized AWM; and (2) the Wake phase where the agent learns a modular policy for each subgoal and verifies or corrects the hypothesized AWM on the basis of its experiences. Our method of hypothesizing an AWM with LLMs and then verifying the AWM based on agent experience not only increases sample efficiency over contemporary methods by an order of magnitude but is also robust to and corrects errors in the LLM, successfully blending noisy internet-scale information from LLMs with knowledge grounded in environment dynamics.
翻译:强化学习(RL)智能体通常以“白板”方式学习,缺乏对世界的先验知识,这使得在稀疏奖励条件下完成复杂任务变得困难。若智能体能预先掌握高层子目标及子目标间转换的知识,便可利用这种抽象世界模型(AWM)进行规划与探索。我们提出使用少样本大语言模型(LLMs)来假设一个AWM,并在探索过程中对其进行测试与验证,以提升具身RL智能体的样本效率。我们的DECKARD智能体将LLM引导的探索应用于《我的世界》中的物品合成,分为两个阶段:(1)梦境阶段——智能体利用LLM将任务分解为一系列子目标序列,即假设的AWM;(2)清醒阶段——智能体为每个子目标学习模块化策略,并根据自身经验验证或修正假设的AWM。我们提出的利用LLM假设AWM、再基于智能体经验验证AWM的方法,不仅将样本效率较现有方法提升一个数量级,还能稳健地修正LLM中的错误,成功融合了来自LLM的嘈杂互联网规模信息与基于环境动态的接地知识。