Eliminating Inductive Bias in Reward Models with Information-Theoretic Guidance

Reward models (RMs) are essential in reinforcement learning from human feedback (RLHF) to align large language models (LLMs) with human values. However, RM training data is commonly recognized as low-quality, containing inductive biases that can easily lead to overfitting and reward hacking. For example, more detailed and comprehensive responses are usually human-preferred but with more words, leading response length to become one of the inevitable inductive biases. A limited number of prior RM debiasing approaches either target a single specific type of bias or model the problem with only simple linear correlations, \textit{e.g.}, Pearson coefficients. To mitigate more complex and diverse inductive biases in reward modeling, we introduce a novel information-theoretic debiasing method called \textbf{D}ebiasing via \textbf{I}nformation optimization for \textbf{R}M (DIR). Inspired by the information bottleneck (IB), we maximize the mutual information (MI) between RM scores and human preference pairs, while minimizing the MI between RM outputs and biased attributes of preference inputs. With theoretical justification from information theory, DIR can handle more sophisticated types of biases with non-linear correlations, broadly extending the real-world application scenarios for RM debiasing methods. In experiments, we verify the effectiveness of DIR with three types of inductive biases: \textit{response length}, \textit{sycophancy}, and \textit{format}. We discover that DIR not only effectively mitigates target inductive biases but also enhances RLHF performance across diverse benchmarks, yielding better generalization abilities. The code and training recipes are available at https://github.com/Qwen-Applications/DIR.

翻译：奖励模型在基于人类反馈的强化学习中对于使大语言模型与人类价值观对齐至关重要。然而，奖励模型训练数据通常被认为质量较低，其中包含的归纳偏差容易导致过拟合和奖励攻击。例如，更详细全面的回复通常更受人类青睐，但往往伴随更多字数，导致回复长度成为不可避免的归纳偏差之一。现有少数奖励模型去偏差方法要么针对单一特定偏差类型，要么仅通过简单线性相关性（如皮尔逊系数）建模问题。为缓解奖励建模中更复杂多样的归纳偏差，我们提出一种新颖的信息论去偏差方法——基于信息优化的奖励模型去偏差法。受信息瓶颈理论启发，该方法最大化奖励模型评分与人类偏好对之间的互信息，同时最小化奖励模型输出与偏好输入中偏差属性之间的互信息。基于信息论的理论论证，DIR能够处理具有非线性相关性的更复杂偏差类型，极大拓展了奖励模型去偏差方法的实际应用场景。实验中，我们通过三种归纳偏差类型验证DIR的有效性：回复长度偏差、迎合性偏差和格式偏差。研究发现DIR不仅能有效缓解目标归纳偏差，还能在多样化基准测试中提升RLHF性能，展现出更优的泛化能力。代码与训练方案已发布于https://github.com/Qwen-Applications/DIR。