Reinforcement learning from human feedback (RLHF) is a promising solution to align large language models (LLMs) more closely with human values. Off-policy preference optimization, where the preference data is obtained from other models, is widely adopted due to its cost efficiency and scalability. However, off-policy preference optimization often suffers from a distributional gap between the policy used for data collection and the target policy, leading to suboptimal optimization. In this paper, we propose a novel strategy to mitigate this problem by simulating on-policy learning with off-policy preference data. Our Weighted Preference Optimization (WPO) method adapts off-policy data to resemble on-policy data more closely by reweighting preference pairs according to their probability under the current policy. This method not only addresses the distributional gap problem but also enhances the optimization process without incurring additional costs. We validate our method on instruction following benchmarks including Alpaca Eval 2 and MT-bench. WPO not only outperforms Direct Preference Optimization (DPO) by up to 5.6% on Alpaca Eval 2 but also establishes a remarkable length-controlled winning rate against GPT-4-turbo of 48.6% based on Llama-3-8B-Instruct, making it the strongest 8B model on the leaderboard. We will release the code and models at https://github.com/wzhouad/WPO.
翻译:基于人类反馈的强化学习(RLHF)是一种使大型语言模型(LLM)更贴合人类价值观的有前景的解决方案。离策略偏好优化因其成本效益和可扩展性而被广泛采用,其偏好数据从其他模型获得。然而,离策略偏好优化常常面临数据收集所用策略与目标策略之间的分布差异问题,导致优化效果欠佳。本文提出一种新颖策略,通过使用离策略偏好数据模拟在策略学习来缓解此问题。我们的加权偏好优化(WPO)方法根据偏好对在当前策略下的概率对其进行重新加权,使离策略数据更贴近在策略数据。该方法不仅解决了分布差异问题,还在不增加额外成本的情况下优化了学习过程。我们在包括Alpaca Eval 2和MT-bench在内的指令遵循基准上验证了我们的方法。WPO不仅在Alpaca Eval 2上比直接偏好优化(DPO)性能高出最多5.6%,而且基于Llama-3-8B-Instruct模型,在长度控制胜率上对GPT-4-turbo达到了48.6%,使其成为该排行榜上最强的8B模型。我们将在 https://github.com/wzhouad/WPO 发布代码和模型。