Because it is difficult to precisely specify complex objectives, reinforcement learning policies are often optimized using flawed proxy rewards that seem to capture the true objective. However, optimizing proxy rewards frequently leads to reward hacking: the optimized reward function ceases to be a good proxy, and the resulting policy performs poorly with respect to the unspecified true reward. Principled solutions to reward hacking have been impeded by the lack of a good definition for the problem. To address this, we introduce a definition of reward hacking based on the correlation between proxy and true rewards for states and actions seen by a "base policy" that breaks down under optimization. We show that this definition captures reward hacking behavior across several realistic settings, including in reinforcement learning from human feedback (RLHF). We then show theoretically that regularization to the base policy can effectively prevent reward hacking. While current RLHF approaches apply a KL penalty between the action distributions of policies, our theory suggests that it is more effective to regularize using the $\chi^2$ divergence between the policies' occupancy measures. We intuitively show why this type of regularization is superior and demonstrate that it better mitigates reward hacking in practice across four realistic domains, including RLHF for LLMs. Our code is available at https://github.com/cassidylaidlaw/orpo.
翻译:由于精确指定复杂目标较为困难,强化学习策略通常通过看似能捕捉真实目标的缺陷代理奖励进行优化。然而,优化代理奖励常导致奖励黑客攻击:优化后的奖励函数不再成为有效的代理,且所得策略在未指定的真实奖励下表现不佳。由于缺乏对该问题的精确定义,奖励黑客攻击的机制化解决方案一直受阻。为此,我们提出一种基于代理奖励与真实奖励相关性的奖励黑客攻击定义,该相关性针对"基础策略"所经历的状态与动作,并在优化过程中失效。我们证明该定义能捕捉多种现实场景下的奖励黑客行为,包括基于人类反馈的强化学习(RLHF)。随后从理论上证明,对基础策略进行正则化可有效防止奖励黑客攻击。当前RLHF方法通常在策略动作分布间施加KL惩罚,而我们的理论表明,使用策略占用测度间的$\chi^2$散度进行正则化更为有效。我们通过直观分析阐明此类正则化的优越性,并在四个现实领域(包括LLM的RLHF)中实证证明其能更有效缓解奖励黑客攻击。代码发布于https://github.com/cassidylaidlaw/orpo。