Reward Shaping to Mitigate Reward Hacking in RLHF

Reinforcement Learning from Human Feedback (RLHF) is essential for aligning large language models (LLMs) with human values. However, RLHF is susceptible to \emph{reward hacking}, where the agent exploits flaws in the reward function rather than learning the intended behavior, thus degrading alignment. Although reward shaping helps stabilize RLHF and partially mitigate reward hacking, a systematic investigation into shaping techniques and their underlying principles remains lacking. To bridge this gap, we present a comprehensive study of the prevalent reward shaping methods. Our analysis suggests two key design principles: (1) the RL reward should be bounded, and (2) the RL reward benefits from rapid initial growth followed by gradual convergence. Guided by these insights, we propose Preference As Reward (PAR), a novel approach that leverages the latent preferences embedded within the reward model as the signal for reinforcement learning. Moreover, PAR exhibits two critical variance-reduction properties that contribute to stabilizing the RLHF training process and effectively extending the tolerance window for early stopping. We evaluated PAR on the base model Gemma2-2B using two datasets, Ultrafeedback-Binarized and HH-RLHF. Experimental results demonstrate PAR's superior performance over other reward shaping methods. On the AlpacaEval 2.0 benchmark, PAR achieves a win rate of at least 5 percentage points higher than competing approaches. Furthermore, PAR exhibits remarkable data efficiency, requiring only a single reference reward for optimal performance, and maintains robustness against reward hacking even after two full epochs of training. The code is available at https://github.com/PorUna-byte/PAR.

翻译：基于人类反馈的强化学习（RLHF）对于使大型语言模型（LLMs）与人类价值观对齐至关重要。然而，RLHF容易受到**奖励黑客攻击**的影响，即智能体利用奖励函数中的缺陷而非学习预期行为，从而损害对齐效果。尽管奖励塑形有助于稳定RLHF并部分缓解奖励黑客攻击，但对塑形技术及其内在原理的系统性研究仍然缺乏。为填补这一空白，我们对主流的奖励塑形方法进行了全面研究。我们的分析提出了两个关键设计原则：(1) 强化学习奖励应有界；(2) 强化学习奖励受益于初期快速增长而后逐渐收敛的模式。基于这些洞见，我们提出了偏好即奖励（PAR），这是一种新颖的方法，它利用奖励模型中隐含的偏好作为强化学习的信号。此外，PAR展现出两个关键的方差缩减特性，有助于稳定RLHF训练过程，并有效扩展了早停的容忍窗口。我们在基础模型Gemma2-2B上使用Ultrafeedback-Binarized和HH-RLHF两个数据集评估了PAR。实验结果表明，PAR的性能优于其他奖励塑形方法。在AlpacaEval 2.0基准测试中，PAR的胜率至少比竞争方法高出5个百分点。此外，PAR表现出卓越的数据效率，仅需单个参考奖励即可实现最优性能，并且在完整训练两个周期后仍能保持对奖励黑客攻击的鲁棒性。代码发布于https://github.com/PorUna-byte/PAR。