The success of reinforcement learning from human feedback (RLHF) in language model alignment is strongly dependent on the quality of the underlying reward model. In this paper, we present a novel approach to improve reward model quality by generating synthetic preference data, thereby augmenting the training dataset with on-policy, high-quality preference pairs. Motivated by the promising results of Best-of-N sampling strategies in language model training, we extend their application to reward model training. This results in a self-training strategy to generate preference pairs by selecting the best and worst candidates in a pool of responses to a given query. Empirically, we find that this approach improves the performance of any reward model, with an effect comparable to the addition of a similar quantity of human preference data. This work opens up new avenues of research for improving RLHF for language model alignment, by offering synthetic preference generation as a solution to reward modeling challenges.
翻译:从人类反馈中进行强化学习(RLHF)在语言模型对齐中的成功高度依赖于底层奖励模型的质量。本文提出了一种通过生成合成偏好数据来提升奖励模型质量的新方法,从而用策略内的高质量偏好对扩充训练数据集。受最佳N选取采样策略在语言模型训练中取得显著成果的启发,我们将该策略的应用扩展至奖励模型训练。由此形成一种自训练策略:通过从给定查询的响应池中选取最优与最差候选者来生成偏好对。实验表明,此方法能有效提升任意奖励模型的性能,其效果与添加同等规模人类偏好数据相当。本研究通过将合成偏好生成作为奖励建模挑战的解决方案,为改进语言模型对齐的RLHF方法开辟了新的研究方向。