Designing effective reward functions is crucial to training reinforcement learning (RL) algorithms. However, this design is non-trivial, even for domain experts, due to the subjective nature of certain tasks that are hard to quantify explicitly. In recent works, large language models (LLMs) have been used for reward generation from natural language task descriptions, leveraging their extensive instruction tuning and commonsense understanding of human behavior. In this work, we hypothesize that LLMs, guided by human feedback, can be used to formulate reward functions that reflect human implicit knowledge. We study this in three challenging settings -- autonomous driving, humanoid locomotion, and dexterous manipulation -- wherein notions of ``good" behavior are tacit and hard to quantify. To this end, we introduce REvolve, a truly evolutionary framework that uses LLMs for reward design in RL. REvolve generates and refines reward functions by utilizing human feedback to guide the evolution process, effectively translating implicit human knowledge into explicit reward functions for training (deep) RL agents. Experimentally, we demonstrate that agents trained on REvolve-designed rewards outperform other state-of-the-art baselines.
翻译:设计有效的奖励函数对于强化学习算法的训练至关重要。然而,由于某些任务具有难以明确量化的主观性,即使对于领域专家而言,这项设计也非易事。在近期研究中,大语言模型凭借其广泛的指令微调和对人类行为的常识理解,已被用于从自然语言任务描述中生成奖励函数。在本工作中,我们假设:在人类反馈的引导下,大语言模型能够构建出反映人类隐性知识的奖励函数。我们在三个具有挑战性的场景中对此进行研究——自动驾驶、人形机器人运动与灵巧操作——这些场景中“良好”行为的概念是隐性的且难以量化。为此,我们提出了REvolve,一个真正进化的框架,利用大语言模型进行强化学习的奖励函数设计。REvolve通过利用人类反馈指导进化过程来生成并优化奖励函数,从而有效地将人类隐性知识转化为用于训练(深度)强化学习智能体的显式奖励函数。实验表明,基于REvolve设计的奖励函数训练的智能体,其性能优于其他最先进的基线方法。