Designing reward functions for efficiently guiding reinforcement learning (RL) agents toward specific behaviors is a complex task. This is challenging since it requires the identification of reward structures that are not sparse and that avoid inadvertently inducing undesirable behaviors. Naively modifying the reward structure to offer denser and more frequent feedback can lead to unintended outcomes and promote behaviors that are not aligned with the designer's intended goal. Although potential-based reward shaping is often suggested as a remedy, we systematically investigate settings where deploying it often significantly impairs performance. To address these issues, we introduce a new framework that uses a bi-level objective to learn \emph{behavior alignment reward functions}. These functions integrate auxiliary rewards reflecting a designer's heuristics and domain knowledge with the environment's primary rewards. Our approach automatically determines the most effective way to blend these types of feedback, thereby enhancing robustness against heuristic reward misspecification. Remarkably, it can also adapt an agent's policy optimization process to mitigate suboptimalities resulting from limitations and biases inherent in the underlying RL algorithms. We evaluate our method's efficacy on a diverse set of tasks, from small-scale experiments to high-dimensional control challenges. We investigate heuristic auxiliary rewards of varying quality -- some of which are beneficial and others detrimental to the learning process. Our results show that our framework offers a robust and principled way to integrate designer-specified heuristics. It not only addresses key shortcomings of existing approaches but also consistently leads to high-performing solutions, even when given misaligned or poorly-specified auxiliary reward functions.
翻译:设计能够高效引导强化学习(RL)智能体实现特定行为的奖励函数是一项复杂任务。其挑战性在于需要识别非稀疏的奖励结构,同时避免无意中诱发不良行为。单纯修改奖励结构以提供更密集、更频繁的反馈可能导致非预期后果,并催生与设计者预期目标相悖的行为。尽管基于势能的奖励塑形常被推荐作为解决方案,但我们系统性地研究了在某些场景下部署该技术会显著损害性能的情况。为解决这些问题,我们提出了一种新框架,利用双层目标学习*行为对齐奖励函数*。这些函数将反映设计者启发式知识与领域知识的辅助奖励与环境的主要奖励相结合。我们的方法能自动确定融合这些反馈的最优方式,从而增强对启发式奖励误指定的鲁棒性。值得注意的是,该方法还能调整智能体的策略优化过程,以缓解底层RL算法中固有局限性和偏差导致的次优性。我们在从小规模实验到高维控制挑战的多样化任务上评估了该方法的有效性。我们研究了质量各异的启发式辅助奖励——其中一些有利于学习过程,而另一些则有害。结果表明,我们的框架提供了一种鲁棒且规范化的方式来整合设计者指定的启发式知识。该方法不仅解决了现有方法的关键缺陷,还能在给定错误对齐或错误指定的辅助奖励函数时,持续产生高性能解决方案。