Reinforcement learning algorithms typically struggle in the absence of a dense, well-shaped reward function. Intrinsically motivated exploration methods address this limitation by rewarding agents for visiting novel states or transitions, but these methods offer limited benefits in large environments where most discovered novelty is irrelevant for downstream tasks. We describe a method that uses background knowledge from text corpora to shape exploration. This method, called ELLM (Exploring with LLMs) rewards an agent for achieving goals suggested by a language model prompted with a description of the agent's current state. By leveraging large-scale language model pretraining, ELLM guides agents toward human-meaningful and plausibly useful behaviors without requiring a human in the loop. We evaluate ELLM in the Crafter game environment and the Housekeep robotic simulator, showing that ELLM-trained agents have better coverage of common-sense behaviors during pretraining and usually match or improve performance on a range of downstream tasks.
翻译:强化学习算法通常在缺乏密集且设计良好的奖励函数时表现不佳。内在动机驱动的探索方法通过奖励智能体访问新颖状态或转移来解决这一限制,但这些方法在大型环境中效果有限,因为大多数发现的新颖性对下游任务无关紧要。我们描述了一种利用文本语料库中的背景知识来塑造探索的方法。该方法名为ELLM(使用大型语言模型进行探索),它通过奖励智能体实现由语言模型基于智能体当前状态描述所建议的目标。通过利用大规模语言模型的预训练能力,ELLM引导智能体朝向对人类有意义且可能有用的行为,无需人类参与循环。我们在Crafter游戏环境和Housekeep机器人模拟器中评估了ELLM,结果显示经过ELLM训练的智能体在预训练期间覆盖更多常识性行为,并且通常在下游任务中达到或超越现有性能。