The success of reinforcement learning from human feedback (RLHF) in language model alignment is strongly dependent on the quality of the underlying reward model. In this paper, we present a novel approach to improve reward model quality by generating synthetic preference data, thereby augmenting the training dataset with on-policy, high-quality preference pairs. Motivated by the promising results of Best-of-N sampling strategies in language model training, we extend their application to reward model training. This results in a self-training strategy to generate preference pairs by selecting the best and worst candidates in a pool of responses to a given query. Empirically, we find that this approach improves the performance of any reward model, with an effect comparable to the addition of a similar quantity of human preference data. This work opens up new avenues of research for improving RLHF for language model alignment, by offering synthetic preference generation as a solution to reward modeling challenges.
翻译:基于人类反馈的强化学习(RLHF)在语言模型对齐中的成功,很大程度上依赖于底层奖励模型的质量。本文提出了一种通过生成合成偏好数据来提升奖励模型质量的新方法,从而用同策略、高质量的偏好对来增强训练数据集。受语言模型训练中“N选一最佳”采样策略的显著成果启发,我们将其应用扩展到奖励模型的训练中。这形成了一种自训练策略,通过从给定查询的候选响应池中选择最佳和最差样本来生成偏好对。实证结果表明,该方法能提升任何奖励模型的性能,其效果相当于添加了数量相当的人类偏好数据。这项工作为改进语言模型对齐的RLHF开辟了新的研究途径,通过提供合成偏好生成作为应对奖励建模挑战的一种解决方案。