SimPO: Simple Preference Optimization with a Reference-Free Reward

from arxiv, Code: https://github.com/princeton-nlp/SimPO. v2 updates: additional baselines (RRHF, SLiC-HF, CPO); a new setting Llama3-Instruct-v0.2 (Appendix G); more analyses (Section 4.4 & Appendix H)

Direct Preference Optimization (DPO) is a widely used offline preference optimization algorithm that reparameterizes reward functions in reinforcement learning from human feedback (RLHF) to enhance simplicity and training stability. In this work, we propose SimPO, a simpler yet more effective approach. The effectiveness of SimPO is attributed to a key design: using the average log probability of a sequence as the implicit reward. This reward formulation better aligns with model generation and eliminates the need for a reference model, making it more compute and memory efficient. Additionally, we introduce a target reward margin to the Bradley-Terry objective to encourage a larger margin between the winning and losing responses, further enhancing the algorithm's performance. We compare SimPO to DPO and its latest variants across various state-of-the-art training setups, including both base and instruction-tuned models like Mistral and Llama3. We evaluated on extensive instruction-following benchmarks, including AlpacaEval 2, MT-Bench, and the recent challenging Arena-Hard benchmark. Our results demonstrate that SimPO consistently and significantly outperforms existing approaches without substantially increasing response length. Specifically, SimPO outperforms DPO by up to 6.4 points on AlpacaEval 2 and by up to 7.5 points on Arena-Hard. Our top-performing model, built on Llama3-8B-Instruct, achieves a remarkable 53.7 length-controlled win rate on AlpacaEval 2 -- surpassing Claude 3 Opus on the leaderboard, and a 36.5 win rate on Arena-Hard -- making it the strongest 8B open-source model.

翻译：直接偏好优化（DPO）是一种广泛使用的离线偏好优化算法，它通过重参数化基于人类反馈的强化学习（RLHF）中的奖励函数，以提升训练简洁性和稳定性。本文提出SimPO，一种更简单却更有效的方法。SimPO的有效性源于一个关键设计：使用序列的平均对数概率作为隐式奖励。这种奖励公式与模型生成过程更为契合，且无需参考模型，从而显著提高了计算和内存效率。此外，我们在Bradley-Terry目标函数中引入了目标奖励间隔，以扩大获胜与失败响应之间的差异，从而进一步提升算法性能。我们在多种前沿训练设置下（包括基础模型和指令微调模型，如Mistral和Llama3），将SimPO与DPO及其最新变体进行了比较。我们在广泛的指令遵循基准测试上进行了评估，包括AlpacaEval 2、MT-Bench以及近期具有挑战性的Arena-Hard基准。实验结果表明，SimPO在不显著增加响应长度的前提下，持续且显著地超越了现有方法。具体而言，SimPO在AlpacaEval 2上比DPO高出最多6.4分，在Arena-Hard上高出最多7.5分。我们基于Llama3-8B-Instruct构建的最佳模型，在AlpacaEval 2上取得了53.7的长度控制胜率——超越了排行榜上的Claude 3 Opus，并在Arena-Hard上获得了36.5的胜率——使其成为当前最强的8B开源模型。