Reward design is a fundamental, yet challenging aspect of practical reinforcement learning (RL). For simple tasks, researchers typically handcraft the reward function, e.g., using a linear combination of several reward factors. However, such reward engineering is subject to approximation bias, incurs large tuning cost, and often cannot provide the granularity required for complex tasks. To avoid these difficulties, researchers have turned to reinforcement learning from human feedback (RLHF), which learns a reward function from human preferences between pairs of trajectory sequences. By leveraging preference-based reward modeling, RLHF learns complex rewards that are well aligned with human preferences, allowing RL to tackle increasingly difficult problems. Unfortunately, the applicability of RLHF is limited due to the high cost and difficulty of obtaining human preference data. In light of this cost, we investigate learning reward functions for complex tasks with less human effort; simply by ranking the importance of the reward factors. More specifically, we propose a new RL framework -- HERON, which compares trajectories using a hierarchical decision tree induced by the given ranking. These comparisons are used to train a preference-based reward model, which is then used for policy learning. We find that our framework can not only train high performing agents on a variety of difficult tasks, but also provide additional benefits such as improved sample efficiency and robustness. Our code is available at https://github.com/abukharin3/HERON.
翻译:奖励设计是实际强化学习(RL)中一个基础但具有挑战性的方面。针对简单任务,研究人员通常手工设计奖励函数,例如使用若干奖励因子的线性组合。然而,这种奖励工程存在近似偏差,调整成本高昂,且往往无法为复杂任务提供所需的粒度。为避免这些困难,研究人员转向了人类反馈强化学习(RLHF),该方法通过从成对轨迹序列的人类偏好中学习奖励函数。借助基于偏好的奖励建模,RLHF能够学习与人类偏好高度对齐的复杂奖励,使RL能够应对日益困难的问题。不幸的是,RLHF的适用性受到限制,原因在于获取人类偏好数据的成本高、难度大。鉴于此成本,我们探索以更少的人力投入学习复杂任务的奖励函数,只需对奖励因子的重要性进行排序即可。具体而言,我们提出了一种新的RL框架——HERON,该框架利用给定排序生成的层次化决策树对轨迹进行比较。这些比较用于训练基于偏好的奖励模型,进而用于策略学习。我们发现,该框架不仅能在一系列困难任务上训练出高性能智能体,还能提供额外优势,例如提升样本效率和鲁棒性。我们的代码已开源在 https://github.com/abukharin3/HERON。