Despite the success of Reinforcement Learning from Human Feedback (RLHF) in aligning language models with human values, reward hacking-or reward over-optimization-remains a major challenge. We identify two key obstacles to its mitigation: (1) reward misgeneralization in reward modeling, where reward models overfit to spurious, preference-irrelevant features; and (2) the lack of suitable regularization during RL optimization, as existing token-level constraints often over-restrict the policy space. To address these issues, we propose InfoRM, an information-theoretic reward modeling framework based on the Information Bottleneck (IB) principle, which filters out preference-irrelevant information to alleviate reward misgeneralization. We further observe that reward-hacked responses manifest as pronounced outliers in InfoRM's IB latent space, measured by Mahalanobis distance from the SFT-induced distribution. Motivated by this, we introduce IBL, a distribution-level regularization that penalizes such deviations, effectively expanding the optimization landscape while maintaining alignment. We prove that IBL is theoretically equivalent to the pessimistic RL objective within the IB latent space. Finally, we present Mahalanobis Outlier Probability (MOP), a statistical metric for quantifying reward hacking severity, enabling principled hyperparameter tuning and online mitigation such as early stopping. Extensive experiments across diverse LLMs and datasets confirm the generality of our findings, the effectiveness of InfoRM and IBL, and the reliability of MOP as a diagnostic tool-collectively advancing the state of RLHF.
翻译:尽管基于人类反馈的强化学习(RLHF)在使语言模型与人类价值观对齐方面取得了成功,但奖励破解(或称奖励过优化)仍然是主要挑战。我们识别出缓解该问题的两个关键障碍:(1)奖励建模中的奖励误泛化,即奖励模型过度拟合虚假的、与偏好无关的特征;(2)RL优化过程中缺乏合适的正则化,因为现有的词元级约束往往过度限制策略空间。为解决这些问题,我们提出InfoRM——一个基于信息瓶颈(IB)原理的信息论奖励建模框架,通过过滤偏好无关信息来缓解奖励误泛化。我们进一步观察到,奖励破解的响应在InfoRM的IB潜在空间中表现为显著的异常值,可通过其与SFT诱导分布的马氏距离进行度量。受此启发,我们引入IBL——一种分布级正则化方法,通过惩罚此类偏离有效扩展优化空间同时保持对齐性。我们证明IBL在理论上等价于IB潜在空间内的悲观RL目标。最后,我们提出马氏异常概率(MOP)这一统计度量指标,用于量化奖励破解的严重程度,从而实现基于原则的超参数调优和在线缓解(如早停策略)。在多种大语言模型和数据集上的大量实验证实了我们发现的普适性、InfoRM与IBL的有效性,以及MOP作为诊断工具的可靠性——这些成果共同推动了RLHF领域的发展。