Reward design in reinforcement learning (RL) is challenging since specifying human notions of desired behavior may be difficult via reward functions or require many expert demonstrations. Can we instead cheaply design rewards using a natural language interface? This paper explores how to simplify reward design by prompting a large language model (LLM) such as GPT-3 as a proxy reward function, where the user provides a textual prompt containing a few examples (few-shot) or a description (zero-shot) of the desired behavior. Our approach leverages this proxy reward function in an RL framework. Specifically, users specify a prompt once at the beginning of training. During training, the LLM evaluates an RL agent's behavior against the desired behavior described by the prompt and outputs a corresponding reward signal. The RL agent then uses this reward to update its behavior. We evaluate whether our approach can train agents aligned with user objectives in the Ultimatum Game, matrix games, and the DealOrNoDeal negotiation task. In all three tasks, we show that RL agents trained with our framework are well-aligned with the user's objectives and outperform RL agents trained with reward functions learned via supervised learning
翻译:强化学习中的奖励设计具有挑战性,因为通过奖励函数指定人类对期望行为的概念可能很困难,或者需要大量专家演示。我们能否通过自然语言界面廉价地设计奖励?本文探讨如何通过将大型语言模型(如GPT-3)作为代理奖励函数来简化奖励设计,其中用户提供包含少量示例(少样本)或期望行为描述(零样本)的文本提示。我们的方法在强化学习框架中利用这种代理奖励函数。具体而言,用户在训练开始时指定一次提示;在训练过程中,语言模型根据提示描述的期望行为评估强化学习智能体的行为,并输出相应的奖励信号;强化学习智能体随后使用此奖励更新其行为。我们评估了该方法是否能在最后通牒博弈、矩阵博弈和DealOrNoDeal谈判任务中训练出与用户目标一致的智能体。在全部三个任务中,我们表明使用本框架训练的强化学习智能体与用户目标高度一致,并优于通过监督学习训练奖励函数的强化学习智能体。