Learn Your Reference Model for Real Good Alignment

The complexity of the alignment problem stems from the fact that existing methods are unstable. Researchers continuously invent various tricks to address this shortcoming. For instance, in the fundamental Reinforcement Learning From Human Feedback (RLHF) technique of Language Model alignment, in addition to reward maximization, the Kullback-Leibler divergence between the trainable policy and the SFT policy is minimized. This addition prevents the model from being overfitted to the Reward Model (RM) and generating texts that are out-of-domain for the RM. The Direct Preference Optimization (DPO) method reformulates the optimization task of RLHF and eliminates the Reward Model while tacitly maintaining the requirement for the policy to be close to the SFT policy. In our paper, we argue that this implicit limitation in the DPO method leads to sub-optimal results. We propose a new method called Trust Region DPO (TR-DPO), which updates the reference policy during training. With such a straightforward update, we demonstrate the effectiveness of TR-DPO against DPO on the Anthropic HH and TLDR datasets. We show that TR-DPO outperforms DPO by up to 19%, measured by automatic evaluation with GPT-4. The new alignment approach that we propose allows us to improve the quality of models across several parameters at once, such as coherence, correctness, level of detail, helpfulness, and harmlessness.

翻译：对齐问题的复杂性源于现有方法的不稳定性。研究人员不断发明各种技巧来解决这一缺陷。例如，在语言模型对齐的基础性技术——基于人类反馈的强化学习（RLHF）中，除了奖励最大化外，还最小化可训练策略与SFT策略之间的Kullback-Leibler散度。这种添加方式防止模型对奖励模型（RM）过拟合，并生成RM域外的文本。直接偏好优化（DPO）方法重新表述了RLHF的优化任务，在消除奖励模型的同时，暗中维持了策略需接近SFT策略的要求。在本文中，我们论证了DPO方法中的这一隐性限制会导致次优结果。我们提出了一种名为信任域DPO（TR-DPO）的新方法，该方法在训练过程中更新参考策略。通过这种直接简单的更新，我们在Anthropic HH和TLDR数据集上证明了TR-DPO相较于DPO的有效性。我们使用GPT-4进行自动评估，结果表明TR-DPO的性能比DPO高出最多19%。我们提出的这种新对齐方法能够同时提升模型在多个参数维度上的质量，如连贯性、正确性、细节程度、有用性和无害性。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日