Designing reward functions for efficiently guiding reinforcement learning (RL) agents toward specific behaviors is a complex task. This is challenging since it requires the identification of reward structures that are not sparse and that avoid inadvertently inducing undesirable behaviors. Naively modifying the reward structure to offer denser and more frequent feedback can lead to unintended outcomes and promote behaviors that are not aligned with the designer's intended goal. Although potential-based reward shaping is often suggested as a remedy, we systematically investigate settings where deploying it often significantly impairs performance. To address these issues, we introduce a new framework that uses a bi-level objective to learn \emph{behavior alignment reward functions}. These functions integrate auxiliary rewards reflecting a designer's heuristics and domain knowledge with the environment's primary rewards. Our approach automatically determines the most effective way to blend these types of feedback, thereby enhancing robustness against heuristic reward misspecification. Remarkably, it can also adapt an agent's policy optimization process to mitigate suboptimalities resulting from limitations and biases inherent in the underlying RL algorithms. We evaluate our method's efficacy on a diverse set of tasks, from small-scale experiments to high-dimensional control challenges. We investigate heuristic auxiliary rewards of varying quality -- some of which are beneficial and others detrimental to the learning process. Our results show that our framework offers a robust and principled way to integrate designer-specified heuristics. It not only addresses key shortcomings of existing approaches but also consistently leads to high-performing solutions, even when given misaligned or poorly-specified auxiliary reward functions.
翻译:为有效引导强化学习(RL)智能体实现特定行为而设计奖励函数是一项复杂任务。这之所以具有挑战性,是因为需要识别既非稀疏又能避免无意中引发不良行为的奖励结构。简单修改奖励结构以提供更密集、更频繁的反馈,可能会导致意想不到的后果,并助长与设计者预期目标不一致的行为。尽管基于势能的奖励塑形常被建议作为补救措施,但我们系统研究了在部署该方法时往往显著损害性能的情境。为解决这些问题,我们引入了一个新框架,该框架使用双层目标来学习**行为对齐奖励函数**。这些函数将反映设计者启发式知识与领域知识的辅助奖励与环境的主要奖励相结合。我们的方法能够自动确定最有效的反馈融合方式,从而增强对启发式奖励误设的鲁棒性。值得注意的是,该方法还能调整智能体的策略优化过程,以减轻因底层强化学习算法固有的局限性和偏差而导致的不良优化结果。我们在从小型实验到高维控制挑战的多样化任务集上评估了该方法的效果。我们研究了质量各异的启发式辅助奖励——其中一些对学习过程有益,另一些则有害。结果表明,我们的框架提供了一种鲁棒且规范的方法来整合设计者指定的启发式知识。该方法不仅解决了现有方法的关键缺陷,而且即使在面对错误对齐或设定不当的辅助奖励函数时,也能持续产生高性能解决方案。