Reasoning in a complex and ambiguous environment is a key goal for Reinforcement Learning (RL) agents. While some sophisticated RL agents can successfully solve difficult tasks, they require a large amount of training data and often struggle to generalize to new unseen environments and new tasks. On the other hand, Large Scale Language Models (LSLMs) have exhibited strong reasoning ability and the ability to to adapt to new tasks through in-context learning. However, LSLMs do not inherently have the ability to interrogate or intervene on the environment. In this work, we investigate how to combine these complementary abilities in a single system consisting of three parts: a Planner, an Actor, and a Reporter. The Planner is a pre-trained language model that can issue commands to a simple embodied agent (the Actor), while the Reporter communicates with the Planner to inform its next command. We present a set of tasks that require reasoning, test this system's ability to generalize zero-shot and investigate failure cases, and demonstrate how components of this system can be trained with reinforcement-learning to improve performance.
翻译:在复杂且模糊的环境中进行推理是强化学习(RL)智能体的关键目标。尽管一些复杂的RL智能体能够成功解决困难任务,但它们需要大量的训练数据,且往往难以泛化到未见过的全新环境和任务。另一方面,大规模语言模型(LSLMs)展现出强大的推理能力以及通过上下文学习适应新任务的能力。然而,LSLMs本身不具备对环境的询问或干预能力。在本研究中,我们探讨如何将这些互补能力整合到一个由三部分组成的单一系统中:规划器、执行器和报告器。规划器是一个预训练的语言模型,可以向简单的具身智能体(执行器)发出指令,而报告器则与规划器通信,以告知其下一步指令。我们提出一组需要推理的任务,测试该系统零样本泛化的能力,并探究失败案例,同时展示如何通过强化学习训练该系统各组件以提升性能。