While large-scale unsupervised language models (LMs) learn broad world knowledge and some reasoning skills, achieving precise control of their behavior is difficult due to the completely unsupervised nature of their training. Existing methods for gaining such steerability collect human labels of the relative quality of model generations and fine-tune the unsupervised LM to align with these preferences, often with reinforcement learning from human feedback (RLHF). However, RLHF is a complex and often unstable procedure, first fitting a reward model that reflects the human preferences, and then fine-tuning the large unsupervised LM using reinforcement learning to maximize this estimated reward without drifting too far from the original model. In this paper we introduce a new parameterization of the reward model in RLHF that enables extraction of the corresponding optimal policy in closed form, allowing us to solve the standard RLHF problem with only a simple classification loss. The resulting algorithm, which we call Direct Preference Optimization (DPO), is stable, performant, and computationally lightweight, eliminating the need for sampling from the LM during fine-tuning or performing significant hyperparameter tuning. Our experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods. Notably, fine-tuning with DPO exceeds PPO-based RLHF in ability to control sentiment of generations, and matches or improves response quality in summarization and single-turn dialogue while being substantially simpler to implement and train.
翻译:尽管大规模无监督语言模型(LMs)学习了广泛的世界知识和一定的推理能力,但由于其训练完全是无监督的,实现对其行为的精确控制较为困难。现有获得此类可控性的方法会收集人类对模型生成结果相对质量的人工标注,并微调无监督语言模型以符合这些偏好,通常采用基于人类反馈的强化学习(RLHF)。然而,RLHF是一个复杂且往往不稳定的过程,它首先需要拟合一个反映人类偏好的奖励模型,然后使用强化学习微调大型无监督语言模型,以最大化这个估计的奖励,同时避免与原始模型偏离太远。在本文中,我们为RLHF中的奖励模型引入了一种新的参数化方法,使得能够以闭式解提取相应的最优策略,从而仅用一个简单的分类损失即可解决标准RLHF问题。我们称所得算法为直接偏好优化(DPO),它具有稳定、高性能和计算轻量的特点,无需在微调期间从语言模型采样或进行大量的超参数调优。实验表明,DPO能够将语言模型微调到与人类偏好对齐,其效果达到或优于现有方法。值得注意的是,使用DPO进行微调在控制生成内容的情感方面超越了基于PPO的RLHF,并且在摘要和单轮对话任务中,其回复质量相当或更优,同时实现和训练过程都大为简化。