Reinforcement Learning from Human Feedback (RLHF) or Verifiable Rewards (RLVR) are two key steps in the post-training of modern Language Models (LMs). A common problem is reward hacking, where the policy may exploit inaccuracies of the reward and learn an unintended behavior. Most previous works address this by limiting the policy update with a Kullback-Leibler (KL) penalty towards a reference model. We propose a different framing: Train the LM in a way that biases policy updates towards regions in which the reward is more accurate. First, we derive a theoretical connection between the accuracy of a reward model and the flatness of an optimum at convergence. Gradient regularization (GR) can then be used to bias training to flatter regions and thereby maintain reward model accuracy. We confirm these results by showing that the gradient norm and reward accuracy are empirically correlated in RLHF. We then show that Reference Resets of the KL penalty implicitly use GR to find flatter regions with higher reward accuracy. We further improve on this by proposing to use explicit GR with an efficient finite-difference estimate. Empirically, GR performs better than a KL penalty across a diverse set of RL experiments with LMs. GR achieves a higher GPT-judged win-rate in RLHF, avoids overly focusing on the format in rule-based math rewards, and prevents hacking the judge in LLM-as-a-Judge math tasks.
翻译:从人类反馈(RLHF)或可验证奖励(RLVR)中进行强化学习是现代语言模型(LM)后训练的两个关键步骤。一个常见问题是奖励破解,即策略可能利用奖励的不准确性并学习到非预期的行为。先前大多数工作通过使用相对于参考模型的Kullback-Leibler(KL)惩罚来限制策略更新以解决此问题。我们提出了一种不同的框架:以偏向策略更新朝向奖励更准确区域的方式训练语言模型。首先,我们推导了奖励模型准确性与收敛时最优解平坦性之间的理论联系。梯度正则化(GR)随后可用于使训练偏向更平坦的区域,从而保持奖励模型的准确性。我们通过展示在RLHF中梯度范数与奖励准确性存在经验相关性来证实这些结果。我们进一步表明,KL惩罚的参考重置隐式地使用GR来寻找具有更高奖励准确性的更平坦区域。我们通过提出使用显式GR配合高效有限差分估计来改进此方法。经验表明,在涉及语言模型的一系列多样化强化学习实验中,GR的表现优于KL惩罚。GR在RLHF中实现了更高的GPT评判胜率,避免了在基于规则的数学奖励中过度关注格式,并防止了在LLM-as-a-Judge数学任务中破解评判者。