Reward design is a fundamental, yet challenging aspect of reinforcement learning (RL). Researchers typically utilize feedback signals from the environment to handcraft a reward function, but this process is not always effective due to the varying scale and intricate dependencies of the feedback signals. This paper shows by exploiting certain structures, one can ease the reward design process. Specifically, we propose a hierarchical reward modeling framework -- HERON for scenarios: (I) The feedback signals naturally present hierarchy; (II) The reward is sparse, but with less important surrogate feedback to help policy learning. Both scenarios allow us to design a hierarchical decision tree induced by the importance ranking of the feedback signals to compare RL trajectories. With such preference data, we can then train a reward model for policy learning. We apply HERON to several RL applications, and we find that our framework can not only train high performing agents on a variety of difficult tasks, but also provide additional benefits such as improved sample efficiency and robustness. Our code is available at \url{https://github.com/abukharin3/HERON}.
翻译:奖励设计是强化学习(RL)中基础但具有挑战性的环节。研究者通常利用来自环境的反馈信号手工设计奖励函数,但由于反馈信号尺度不一且存在复杂的相互依赖关系,这一过程往往难以奏效。本文表明,通过利用特定结构可以简化奖励设计过程。具体而言,我们针对以下两种场景提出了一种层次化奖励建模框架——HERON:(I)反馈信号天然呈现层次结构;(II)奖励信号稀疏,但存在重要性较低的替代反馈以辅助策略学习。这两种场景都允许我们根据反馈信号的重要性排序构建层次化决策树,从而对强化学习轨迹进行比较。利用此类偏好数据,我们可以训练奖励模型以用于策略学习。我们将HERON应用于多个强化学习任务,发现该框架不仅能在多种复杂任务上训练出高性能智能体,还能带来样本效率提升和鲁棒性增强等额外优势。代码发布于 \url{https://github.com/abukharin3/HERON}。