Despite the success of reinforcement learning from human feedback (RLHF) in aligning language models with human values, reward hacking, also termed reward overoptimization, remains a critical challenge. This issue primarily arises from reward misgeneralization, where reward models (RMs) compute reward using spurious features that are irrelevant to human preferences. In this work, we tackle this problem from an information-theoretic perspective and propose a framework for reward modeling, namely InfoRM, by introducing a variational information bottleneck objective to filter out irrelevant information. Notably, we further identify a correlation between overoptimization and outliers in the IB latent space of InfoRM, establishing it as a promising tool for detecting reward overoptimization. Inspired by this finding, we propose the Cluster Separation Index (CSI), which quantifies deviations in the IB latent space, as an indicator of reward overoptimization to facilitate the development of online mitigation strategies. Extensive experiments on a wide range of settings and RM scales (70M, 440M, 1.4B, and 7B) demonstrate the effectiveness of InfoRM. Further analyses reveal that InfoRM's overoptimization detection mechanism is not only effective but also robust across a broad range of datasets, signifying a notable advancement in the field of RLHF. The code will be released upon acceptance.
翻译:尽管基于人类反馈的强化学习(RLHF)在使语言模型与人类价值观对齐方面取得了成功,但奖励黑客攻击(亦称奖励过优化)仍然是一个关键挑战。该问题主要源于奖励误泛化,即奖励模型(RMs)使用与人类偏好无关的虚假特征计算奖励。在本工作中,我们从信息论的角度解决这一问题,提出了一个奖励建模框架,即InfoRM,通过引入变分信息瓶颈目标来过滤无关信息。值得注意的是,我们进一步发现了过优化与InfoRM信息瓶颈潜在空间中的异常值之间的相关性,从而确立其作为检测奖励过优化的一种有前景的工具。受此发现启发,我们提出了聚类分离指数(CSI),该指数量化信息瓶颈潜在空间中的偏差,作为奖励过优化的指示器,以促进在线缓解策略的开发。在广泛设置和RM规模(70M、440M、1.4B和7B)上进行的大量实验证明了InfoRM的有效性。进一步的分析表明,InfoRM的过优化检测机制不仅有效,而且在广泛的数据集上具有鲁棒性,这标志着RLHF领域的一项显著进展。代码将在论文被接受后发布。