Reinforcement Learning from Human Feedback (RLHF) is currently the most widely used method to align large language models (LLMs) with human preferences. Existing RLHF methods can be roughly categorized as either reward-based or reward-free. Novel applications such as ChatGPT and Claude leverage reward-based methods that first learn a reward model and apply actor-critic algorithms, such as Proximal Policy Optimization (PPO). However, in academic benchmarks, state-of-the-art results are often achieved via reward-free methods, such as Direct Preference Optimization (DPO). Is DPO truly superior to PPO? Why does PPO perform poorly on these benchmarks? In this paper, we first conduct both theoretical and empirical studies on the algorithmic properties of DPO and show that DPO may have fundamental limitations. Moreover, we also comprehensively examine PPO and reveal the key factors for the best performances of PPO in fine-tuning LLMs. Finally, we benchmark DPO and PPO across a collection of RLHF testbeds, ranging from dialogue to code generation. Experiment results demonstrate that PPO is able to surpass other alignment methods in all cases and achieve state-of-the-art results in challenging code competitions.
翻译:基于人类反馈的强化学习(RLHF)是目前最广泛用于将大语言模型(LLMs)与人类偏好对齐的方法。现有RLHF方法可大致分为基于奖励和无奖励两类。ChatGPT和Claude等新型应用采用基于奖励的方法,该方法先学习奖励模型,再应用近端策略优化(PPO)等行动者-批评家算法。然而,在学术基准测试中,最先进的结果往往通过无奖励方法(如直接偏好优化,DPO)取得。DPO是否真正优于PPO?为何PPO在这些基准测试中表现不佳?本文首先对DPO的算法特性进行理论与实证研究,表明DPO可能存在根本性局限。此外,我们全面考察PPO,揭示其在微调LLMs中取得最佳表现的关键因素。最后,我们在从对话到代码生成的RLHF测试平台集合上对DPO和PPO进行基准测试。实验结果表明,PPO在所有情况下均能超越其他对齐方法,并在具有挑战性的代码竞赛中取得最先进结果。