Despite the success of reinforcement learning from human feedback (RLHF) in aligning language models with human values, reward hacking, also termed reward overoptimization, remains a critical challenge, which primarily stems from limitations in reward modeling, i.e., generalizability of the reward model and inconsistency in the preference dataset. In this work, we tackle this problem from an information theoretic-perspective, and propose a generalizable and robust framework for reward modeling, namely InfoRM, by introducing a variational information bottleneck objective to filter out irrelevant information and developing a mechanism for model complexity modulation. Notably, we further identify a correlation between overoptimization and outliers in the latent space, establishing InfoRM as a promising tool for detecting reward overoptimization. Inspired by this finding, we propose the Integrated Cluster Deviation Score (ICDS), which quantifies deviations in the latent space, as an indicator of reward overoptimization to facilitate the development of online mitigation strategies. Extensive experiments on a wide range of settings and model scales (70M, 440M, 1.4B, and 7B) support the effectiveness of InfoRM. Further analyses reveal that InfoRM's overoptimization detection mechanism is effective, potentially signifying a notable advancement in the field of RLHF. Code will be released upon acceptance.
翻译:尽管从人类反馈中强化学习(RLHF)在对齐语言模型与人类价值观方面取得了成功,但奖励黑客(也称奖励过优化)仍是一个关键挑战,其主要源于奖励建模的局限性,即奖励模型的泛化能力与偏好数据集中的不一致性。在本工作中,我们基于信息论视角解决该问题,提出了一种通用且稳健的奖励建模框架InfoRM,通过引入变分信息瓶颈目标来过滤无关信息,并开发了一种模型复杂度调控机制。值得注意的是,我们进一步发现了过优化与潜在空间中异常值之间的关联,从而将InfoRM确立为检测奖励过优化的有力工具。受此发现启发,我们提出了集成簇偏差评分(ICDS)作为潜在空间偏离程度的量化指标,用以指示奖励过优化,从而促进在线缓解策略的开发。在多种设置及模型规模(70M、440M、1.4B和7B)上的大量实验支持了InfoRM的有效性。进一步分析表明,InfoRM的过优化检测机制行之有效,这或将成为RLHF领域的重要进展。代码将在论文被接收后公开。