Learn Your Reference Model for Real Good Alignment

The complexity of the alignment problem stems from the fact that existing methods are considered unstable. Reinforcement Learning from Human Feedback (RLHF) addresses this issue by minimizing the KL divergence between the trained policy and the initial supervised fine-tuned policy (SFT) to avoid generating out-of-domain samples for the reward model (RM). Recently, many methods have emerged that shift from online to offline optimization, reformulating the RLHF objective and removing the reward model (DPO, IPO, KTO). Despite eliminating the reward model and the challenges it posed, these algorithms are still constrained in terms of closeness of the trained policy to the SFT one. In our paper, we argue that this implicit limitation in the offline optimization methods leads to suboptimal results. To address this issue, we propose a class of new methods called Trust Region (TR-DPO, TR-IPO, TR-KTO), which update the reference policy during training. With this straightforward update approach, we demonstrate the effectiveness of the new paradigm of language model alignment against the classical one on the Anthropic-HH and Reddit TL;DR datasets. Most notably, when automatically comparing TR methods and baselines side by side using pretrained Pythia 6.9B models on the Reddit TL;DR task, the difference in win rates reaches 8.4% for DPO, 14.3% for IPO, and 15% for KTO. Finally, by assessing model response ratings grounded on criteria such as coherence, correctness, helpfulness, and harmlessness, we demonstrate that our proposed methods significantly outperform existing techniques.

翻译：对齐问题的复杂性源于现有方法被认为不稳定的这一事实。基于人类反馈的强化学习（RLHF）通过最小化训练策略与初始监督微调策略（SFT）之间的KL散度来解决此问题，以避免为奖励模型（RM）生成域外样本。近年来，涌现出许多从在线优化转向离线优化的方法，这些方法重新表述了RLHF目标并移除了奖励模型（DPO、IPO、KTO）。尽管消除了奖励模型及其带来的挑战，这些算法在训练策略与SFT策略的接近程度方面仍然受到限制。在本文中，我们认为离线优化方法中的这种隐含限制会导致次优结果。为解决此问题，我们提出了一类称为信任区域（TR-DPO、TR-IPO、TR-KTO）的新方法，这些方法在训练过程中更新参考策略。借助这一直接简单的更新方式，我们在Anthropic-HH和Reddit TL;DR数据集上展示了语言模型对齐新范式相对于经典范式的有效性。最值得注意的是，在使用预训练的Pythia 6.9B模型对Reddit TL;DR任务进行自动并排比较TR方法与基线时，DPO的胜率差异达到8.4%，IPO达到14.3%，KTO达到15%。最后，通过基于连贯性、正确性、有用性和无害性等标准评估模型响应的评分，我们证明所提出的方法显著优于现有技术。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

Graph Transformer近期进展

专知会员服务

65+阅读 · 2023年1月5日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日