Reinforcement learning from human feedback (RLHF) plays a crucial role in aligning language models with human preferences. While the significance of dataset quality is generally recognized, explicit investigations into its impact within the RLHF framework, to our knowledge, have been limited. This paper addresses the issue of text quality within the preference dataset by focusing on direct preference optimization (DPO), an increasingly adopted reward-model-free RLHF method. We confirm that text quality significantly influences the performance of models optimized with DPO more than those optimized with reward-model-based RLHF. Building on this new insight, we propose an extension of DPO, termed filtered direct preference optimization (fDPO). fDPO uses a trained reward model to monitor the quality of texts within the preference dataset during DPO training. Samples of lower quality are discarded based on comparisons with texts generated by the model being optimized, resulting in a more accurate dataset. Experimental results demonstrate that fDPO enhances the final model performance. Our code is available at https://github.com/CyberAgentAILab/filtered-dpo.
翻译:基于人类反馈的强化学习(RLHF)在使语言模型与人类偏好对齐方面发挥着关键作用。尽管数据集质量的重要性已得到普遍认可,但据我们所知,在RLHF框架内对其影响进行明确研究的工作仍然有限。本文通过聚焦于直接偏好优化(DPO)——一种日益被采用的无奖励模型RLHF方法——探讨了偏好数据集中的文本质量问题。我们证实,与基于奖励模型的RLHF方法相比,文本质量对使用DPO优化的模型性能影响更为显著。基于这一新发现,我们提出了DPO的一种扩展方法,称为过滤式直接偏好优化(fDPO)。fDPO在DPO训练过程中,使用一个训练好的奖励模型来监控偏好数据集中的文本质量。通过与被优化模型生成的文本进行比较,质量较低的样本将被丢弃,从而得到一个更精确的数据集。实验结果表明,fDPO能够提升最终模型的性能。我们的代码可在 https://github.com/CyberAgentAILab/filtered-dpo 获取。