When applying reinforcement learning from human feedback (RLHF), the reward is learned from data and, therefore, always has some error. It is common to mitigate this by regularizing the policy with KL divergence from a base model, with the hope that balancing reward with regularization will achieve desirable outcomes despite this reward misspecification. We show that when the reward function has light-tailed error, optimal policies under less restrictive KL penalties achieve arbitrarily high utility. However, if error is heavy-tailed, some policies obtain arbitrarily high reward despite achieving no more utility than the base model--a phenomenon we call catastrophic Goodhart. We adapt a discrete optimization method to measure the tails of reward models, finding that they are consistent with light-tailed error. However, the pervasiveness of heavy-tailed distributions in many real-world applications indicates that future sources of RL reward could have heavy-tailed error, increasing the likelihood of reward hacking even with KL regularization.
翻译:在应用人类反馈强化学习(RLHF)时,奖励函数是从数据中习得的,因此必然存在一定误差。通常采用KL散度对策略进行正则化以缓解此问题,期望通过平衡奖励与正则化项,即使在奖励设定错误的情况下仍能获得理想结果。本文证明:当奖励函数具有轻尾误差时,在较弱KL惩罚约束下的最优策略可获得任意高的效用值。然而,若误差呈重尾分布,某些策略虽能获得任意高的奖励值,其实际效用却不超过基础模型——我们将此现象称为灾难性古德哈特。通过改进离散优化方法测量奖励模型的尾部特征,发现其与轻尾误差假设相符。但考虑到重尾分布在众多现实应用中的普遍性,未来RL奖励函数可能呈现重尾误差特征,这将加剧即使采用KL正则化仍可能出现奖励破解的风险。