Despite the success of reinforcement learning from human feedback (RLHF) in aligning language models with human values, reward hacking, also termed reward overoptimization, remains a critical challenge, which primarily stems from limitations in reward modeling, i.e., generalizability of the reward model and inconsistency in the preference dataset. In this work, we tackle this problem from an information theoretic-perspective, and propose a generalizable and robust framework for reward modeling, namely InfoRM, by introducing a variational information bottleneck objective to filter out irrelevant information and developing a mechanism for model complexity modulation. Notably, we further identify a correlation between overoptimization and outliers in the latent space, establishing InfoRM as a promising tool for detecting reward overoptimization. Inspired by this finding, we propose the Integrated Cluster Deviation Score (ICDS), which quantifies deviations in the latent space, as an indicator of reward overoptimization to facilitate the development of online mitigation strategies. Extensive experiments on a wide range of settings and model scales (70M, 440M, 1.4B, and 7B) support the effectiveness of InfoRM. Further analyses reveal that InfoRM's overoptimization detection mechanism is effective, potentially signifying a notable advancement in the field of RLHF. Code will be released upon acceptance.
翻译:尽管基于人类反馈的强化学习(RLHF)在使语言模型与人类价值观对齐方面取得了成功,但奖励黑客(亦称奖励过度优化)仍是一个关键挑战,其主要源于奖励建模的局限性,即奖励模型的泛化能力不足以及偏好数据集的不一致性。本文从信息论视角出发,提出了一种通用且鲁棒的奖励建模框架——InfoRM,通过引入变分信息瓶颈目标过滤无关信息,并开发了模型复杂度调节机制。值得注意的是,我们进一步发现了过度优化与潜空间中的异常值之间的相关性,使InfoRM成为检测奖励过度优化的有效工具。受此发现启发,我们提出了集成簇偏差分数(ICDS),通过量化潜空间中的偏差来指示奖励过度优化,从而促进在线缓解策略的开发。在多种设置和模型规模(70M、440M、1.4B和7B)上的广泛实验验证了InfoRM的有效性。进一步分析表明,InfoRM的过度优化检测机制切实有效,这可能标志着RLHF领域的一项显著进展。代码将在论文接受后公开。