The complexity of the alignment problem stems from the fact that existing methods are considered unstable. Reinforcement Learning from Human Feedback (RLHF) addresses this issue by minimizing the KL divergence between the trained policy and the initial supervised fine-tuned policy (SFT) to avoid generating out-of-domain samples for the reward model (RM). Recently, many methods have emerged that shift from online to offline optimization, reformulating the RLHF objective and removing the reward model (DPO, IPO, KTO). Despite eliminating the reward model and the challenges it posed, these algorithms are still constrained in terms of closeness of the trained policy to the SFT one. In our paper, we argue that this implicit limitation in the offline optimization methods leads to suboptimal results. To address this issue, we propose a class of new methods called Trust Region (TR-DPO, TR-IPO, TR-KTO), which update the reference policy during training. With this straightforward update approach, we demonstrate the effectiveness of the new paradigm of language model alignment against the classical one on the Anthropic-HH and Reddit TL;DR datasets. Most notably, when automatically comparing TR methods and baselines side by side using pretrained Pythia 6.9B models on the Reddit TL;DR task, the difference in win rates reaches 8.4% for DPO, 14.3% for IPO, and 15% for KTO. Finally, by assessing model response ratings grounded on criteria such as coherence, correctness, helpfulness, and harmlessness, we demonstrate that our proposed methods significantly outperform existing techniques.
翻译:对齐问题的复杂性源于现有方法被认为不稳定的这一事实。基于人类反馈的强化学习(RLHF)通过最小化训练策略与初始监督微调策略(SFT)之间的KL散度来解决此问题,以避免为奖励模型(RM)生成域外样本。近年来,涌现出许多从在线优化转向离线优化的方法,这些方法重新表述了RLHF目标并移除了奖励模型(DPO、IPO、KTO)。尽管消除了奖励模型及其带来的挑战,这些算法在训练策略与SFT策略的接近程度方面仍然受到限制。在本文中,我们认为离线优化方法中的这种隐含限制会导致次优结果。为解决此问题,我们提出了一类称为信任区域(TR-DPO、TR-IPO、TR-KTO)的新方法,这些方法在训练过程中更新参考策略。借助这一直接简单的更新方式,我们在Anthropic-HH和Reddit TL;DR数据集上展示了语言模型对齐新范式相对于经典范式的有效性。最值得注意的是,在使用预训练的Pythia 6.9B模型对Reddit TL;DR任务进行自动并排比较TR方法与基线时,DPO的胜率差异达到8.4%,IPO达到14.3%,KTO达到15%。最后,通过基于连贯性、正确性、有用性和无害性等标准评估模型响应的评分,我们证明所提出的方法显著优于现有技术。