Reinforcement Learning from Human Feedback (RLHF) is a pivotal technique that aligns language models closely with human-centric values. The initial phase of RLHF involves learning human values using a reward model from ranking data. It is observed that the performance of the reward model degrades after one epoch of training, and optimizing too much against the learned reward model eventually hinders the true objective. This paper delves into these issues, leveraging the theoretical insights to design improved reward learning algorithm termed 'Iterative Data Smoothing' (IDS). The core idea is that during each training epoch, we not only update the model with the data, but also update the date using the model, replacing hard labels with soft labels. Our empirical findings highlight the superior performance of this approach over the traditional methods.
翻译:基于人类反馈的强化学习(RLHF)是一项关键技术,能够使语言模型与以人为中心的价值观紧密对齐。RLHF的初始阶段涉及使用排序数据通过奖励模型来学习人类价值观。研究发现,奖励模型在训练一个周期后性能会下降,并且对学习到的奖励模型进行过度优化最终会阻碍真实目标的实现。本文深入探讨了这些问题,利用理论洞察设计了一种改进的奖励学习算法,称为“迭代数据平滑”(IDS)。其核心思想是:在每个训练周期中,我们不仅使用数据更新模型,还使用模型更新数据,将硬标签替换为软标签。我们的实证结果表明,该方法相较于传统方法具有更优越的性能。