Great success has been reported using Reinforcement Learning from Human Feedback (RLHF) to align large language models, with open preference datasets enabling wider experimentation, particularly for "helpfulness" in tasks like dialogue and web question answering. Alongside these improvements, however, RLHF also often drives models to produce longer outputs. This paper demonstrates, on three diverse settings, that optimizing for response length is, much more than previously thought, a significant factor behind RLHF. Studying the strategies RL optimization uses to maximize reward, we find improvements in reward to largely be driven by increasing response length, instead of other features. Indeed, we find that even a purely length-based reward reproduces most downstream RLHF improvements over supervised fine-tuned models. Testing a comprehensive set of length-countering interventions, we identify the dominant source of these biases to be reward models, which, by studying training dynamics, we find are non-robust and easily influenced by length biases in preference data.
翻译:基于人类反馈的强化学习(RLHF)在大型语言模型对齐方面已取得显著成功,开放偏好数据集使得更广泛的实验成为可能,尤其在对话和网络问答等任务的"帮助性"方面。然而,随着这些改进,RLHF也常常驱使模型生成更长的输出。本文通过三个不同场景的实证研究表明,优化响应长度是RLHF背后远超先前认知的关键因素。通过研究RL优化策略如何最大化奖励,我们发现奖励的提升主要源于响应长度的增加,而非其他特征。事实上,我们发现即使仅基于长度的奖励函数,也能复现监督微调模型在RLHF下游任务中的大部分改进效果。通过测试一系列全面的长度对抗干预措施,我们确定这些偏差的主要来源是奖励模型——通过分析训练动态,我们发现这些模型对偏好数据中的长度偏差缺乏鲁棒性且极易受到影响。