Solving long-horizon, temporally-extended tasks using Reinforcement Learning (RL) is challenging, compounded by the common practice of learning without prior knowledge (or tabula rasa learning). Humans can generate and execute plans with temporally-extended actions and quickly learn to perform new tasks because we almost never solve problems from scratch. We want autonomous agents to have this same ability. Recently, LLMs have been shown to encode a tremendous amount of knowledge about the world and to perform impressive in-context learning and reasoning. However, using LLMs to solve real world problems is hard because they are not grounded in the current task. In this paper we exploit the planning capabilities of LLMs while using RL to provide learning from the environment, resulting in a hierarchical agent that uses LLMs to solve long-horizon tasks. Instead of completely relying on LLMs, they guide a high-level policy, making learning significantly more sample efficient. This approach is evaluated in simulation environments such as MiniGrid, SkillHack, and Crafter, and on a real robot arm in block manipulation tasks. We show that agents trained using our approach outperform other baselines methods and, once trained, don't need access to LLMs during deployment.
翻译:使用强化学习解决需要长期规划、时间跨度长的任务极具挑战性,而常见的无先验知识学习(即白板学习)方法进一步加剧了这一难题。人类能够生成并执行具有时间延展性的行动计划,同时快速掌握新任务,因为我们几乎从不从零开始解决问题。我们希望自主智能体也能具备这种能力。近期研究表明,大语言模型编码了海量世界知识,并展现出惊人的上下文学习与推理能力。然而,由于大语言模型缺乏对当前任务的具身理解,直接将其用于解决现实世界问题存在困难。本文利用大语言模型的规划能力,同时结合强化学习从环境中获取反馈,构建了能使用大语言模型解决长期任务的分层智能体。该方法并非完全依赖大语言模型,而是将其作为高层策略的引导,使学习过程显著提升样本效率。我们在MiniGrid、SkillHack和Crafter等模拟环境,以及真实机械臂的积木操作任务中进行了评估。结果表明,采用本方法训练的智能体性能优于其他基线方法,且在训练完成后部署阶段无需访问大语言模型。