Direct Preference Optimization: Your Language Model is Secretly a Reward Model

While large-scale unsupervised language models (LMs) learn broad world knowledge and some reasoning skills, achieving precise control of their behavior is difficult due to the completely unsupervised nature of their training. Existing methods for gaining such steerability collect human labels of the relative quality of model generations and fine-tune the unsupervised LM to align with these preferences, often with reinforcement learning from human feedback (RLHF). However, RLHF is a complex and often unstable procedure, first fitting a reward model that reflects the human preferences, and then fine-tuning the large unsupervised LM using reinforcement learning to maximize this estimated reward without drifting too far from the original model. In this paper we introduce a new parameterization of the reward model in RLHF that enables extraction of the corresponding optimal policy in closed form, allowing us to solve the standard RLHF problem with only a simple classification loss. The resulting algorithm, which we call Direct Preference Optimization (DPO), is stable, performant, and computationally lightweight, eliminating the need for sampling from the LM during fine-tuning or performing significant hyperparameter tuning. Our experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods. Notably, fine-tuning with DPO exceeds PPO-based RLHF in ability to control sentiment of generations, and matches or improves response quality in summarization and single-turn dialogue while being substantially simpler to implement and train.

翻译：尽管大规模无监督语言模型（LMs）学习了广泛的世界知识和一定的推理能力，但由于其训练完全是无监督的，实现对其行为的精确控制较为困难。现有获得此类可控性的方法会收集人类对模型生成结果相对质量的人工标注，并微调无监督语言模型以符合这些偏好，通常采用基于人类反馈的强化学习（RLHF）。然而，RLHF是一个复杂且往往不稳定的过程，它首先需要拟合一个反映人类偏好的奖励模型，然后使用强化学习微调大型无监督语言模型，以最大化这个估计的奖励，同时避免与原始模型偏离太远。在本文中，我们为RLHF中的奖励模型引入了一种新的参数化方法，使得能够以闭式解提取相应的最优策略，从而仅用一个简单的分类损失即可解决标准RLHF问题。我们称所得算法为直接偏好优化（DPO），它具有稳定、高性能和计算轻量的特点，无需在微调期间从语言模型采样或进行大量的超参数调优。实验表明，DPO能够将语言模型微调到与人类偏好对齐，其效果达到或优于现有方法。值得注意的是，使用DPO进行微调在控制生成内容的情感方面超越了基于PPO的RLHF，并且在摘要和单轮对话任务中，其回复质量相当或更优，同时实现和训练过程都大为简化。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/