In this work, we study the issue of reward hacking on the response length, a challenge emerging in Reinforcement Learning from Human Feedback (RLHF) on LLMs. A well-formatted, verbose but less helpful response from the LLMs can often deceive LLMs or even human evaluators to achieve high scores. The same issue also holds for some reward models in RL. To address the challenges in both training and evaluation, we establish a more reliable evaluation protocol for comparing different training configurations, which inspects the trade-off between LLM evaluation score and response length obtained by varying training hyperparameters. Based on this evaluation, we conduct large-scale studies, where the results shed insights into the efficacy of hyperparameters and tricks used in RL on mitigating length bias. We further propose to improve the reward model by jointly training two linear heads on shared feature representations to predict the rewards, one trained to correlate with length, and the other trained to decorrelate with length and therefore focus more on the actual content. We then discard the length head in RL to prevent reward hacking on length. Experiments demonstrate that our approach almost eliminates the reward correlation with length, and improves the obtained policy by a significant margin.
翻译:本研究探讨了在基于人类反馈的强化学习(RLHF)应用于大语言模型(LLMs)时,由响应长度引发的奖励攻击问题。格式规范、冗长但缺乏实质帮助的LLM响应,往往能欺骗LLM甚至人类评估者获得高分,这一问题同样存在于部分RL奖励模型中。为应对训练与评估中的双重挑战,我们建立了一套更可靠的评估协议来比较不同训练配置——通过分析训练超参数变化下LLM评分与响应长度之间的权衡关系。基于该评估框架,我们开展了大规模研究,揭示了超参数及RL技巧在缓解长度偏差方面的有效性及其内在机理。我们进一步提出改进奖励模型的方法:在共享特征表示上联合训练两个线性分类头进行奖励预测——一个训练为与长度正相关,另一个训练为与长度解相关从而更专注于实际内容。在后续RL阶段,我们剔除长度分类头以阻止针对长度的奖励攻击。实验表明,本方法几乎完全消除了奖励与长度之间的相关性,并显著提升了最终策略的性能。