Reward function is essential in reinforcement learning (RL), serving as the guiding signal to incentivize agents to solve given tasks, however, is also notoriously difficult to design. In many cases, only imperfect rewards are available, which inflicts substantial performance loss for RL agents. In this study, we propose a unified offline policy optimization approach, \textit{RGM (Reward Gap Minimization)}, which can smartly handle diverse types of imperfect rewards. RGM is formulated as a bi-level optimization problem: the upper layer optimizes a reward correction term that performs visitation distribution matching w.r.t. some expert data; the lower layer solves a pessimistic RL problem with the corrected rewards. By exploiting the duality of the lower layer, we derive a tractable algorithm that enables sampled-based learning without any online interactions. Comprehensive experiments demonstrate that RGM achieves superior performance to existing methods under diverse settings of imperfect rewards. Further, RGM can effectively correct wrong or inconsistent rewards against expert preference and retrieve useful information from biased rewards.
翻译:奖励函数在强化学习中至关重要,它作为引导信号激励智能体完成给定任务,但同时也是出了名的难以设计。在许多情况下,只能获得不完美的奖励,这会导致强化学习智能体性能显著下降。在本研究中,我们提出了一种统一的离线策略优化方法——RGM(奖励差距最小化),它能够智能地处理多种类型的不完美奖励。RGM被形式化为一个双层优化问题:上层优化一个奖励修正项,使其与某些专家数据的访问分布匹配;下层则利用修正后的奖励求解一个悲观强化学习问题。通过利用下层的对偶性,我们推导出一个可行的算法,使得无需任何在线交互即可进行基于样本的学习。综合实验表明,在多种不完美奖励设置下,RGM取得了优于现有方法的性能。此外,RGM能够有效纠正偏离专家偏好的错误或不一致奖励,并从有偏奖励中提取有用信息。