Because it is difficult to precisely specify complex objectives, reinforcement learning policies are often optimized using proxy reward functions that only approximate the true goal. However, optimizing proxy rewards frequently leads to reward hacking: the optimized reward function ceases to be a good proxy and the resulting policy performs poorly with respect to the unspecified true reward. Principled solutions to reward hacking have been impeded by the lack of a good definition for the problem. To address this gap, we introduce a definition of reward hacking based on the correlation between proxy and true rewards for states and actions seen by a "reference policy" that breaks down under optimization. We show that this definition captures reward hacking behavior across several realistic settings, including in reinforcement learning from human feedback (RLHF). Using our formulation, we show theoretically that regularization to the reference policy can effectively prevent reward hacking. While the current practice in RLHF applies a KL penalty between action distributions for this purpose, our theory suggests regularizing the $\chi^2$ divergence between the policies' occupancy measures can be more effective. We intuitively show the benefits of this type of regularization and demonstrate that it better mitigates reward hacking in practice across four realistic settings, including RLHF. Our code is available at https://github.com/cassidylaidlaw/orpo.
翻译:由于难以精确描述复杂目标,强化学习策略通常通过近似真实目标的代理奖励函数进行优化。然而,优化代理奖励常导致奖励破解现象:优化后的奖励函数不再具备良好的代理性,所得策略在未明确定义的真实奖励下表现欠佳。长期以来,因缺乏对奖励破解问题的准确定义,其根本性解决方案进展受阻。为填补这一空白,我们提出基于"参考策略"所经历状态与动作中代理奖励与真实奖励相关性的奖励破解定义,该相关性在优化过程中会失效。我们证明该定义能捕捉多种现实场景中的奖励破解行为,包括基于人类反馈的强化学习(RLHF)。基于此理论框架,我们证明对参考策略进行正则化可有效防止奖励破解。当前RLHF实践采用动作分布间的KL散度惩罚实现此目的,而我们的理论表明对策略占用测度的$\chi^2$散度进行正则化可能更为有效。我们直观展示了此类正则化的优势,并在包括RLHF在内的四种现实场景中验证其能更有效地缓解奖励破解。代码发布于https://github.com/cassidylaidlaw/orpo。