SimPO: Simple Preference Optimization with a Reference-Free Reward

Direct Preference Optimization (DPO) is a widely used offline preference optimization algorithm that reparameterizes reward functions in reinforcement learning from human feedback (RLHF) to enhance simplicity and training stability. In this work, we propose SimPO, a simpler yet more effective approach. The effectiveness of SimPO is attributed to a key design: using the average log probability of a sequence as the implicit reward. This reward formulation better aligns with model generation and eliminates the need for a reference model, making it more compute and memory efficient. Additionally, we introduce a target reward margin to the Bradley-Terry objective to encourage a larger margin between the winning and losing responses, further enhancing the algorithm's performance. We compare SimPO to DPO and its latest variants across various state-of-the-art training setups, including both base and instruction-tuned models like Mistral and Llama3. We evaluated on extensive instruction-following benchmarks, including AlpacaEval 2, MT-Bench, and the recent challenging Arena-Hard benchmark. Our results demonstrate that SimPO consistently and significantly outperforms existing approaches without substantially increasing response length. Specifically, SimPO outperforms DPO by up to 6.4 points on AlpacaEval 2 and by up to 7.5 points on Arena-Hard. Our top-performing model, built on Llama3-8B-Instruct, achieves a remarkable 44.7 length-controlled win rate on AlpacaEval 2 -- surpassing Claude 3 Opus on the leaderboard, and a 33.8 win rate on Arena-Hard -- making it the strongest 8B open-source model.

翻译：直接偏好优化（DPO）是一种广泛使用的离线偏好优化算法，它通过重参数化基于人类反馈的强化学习（RLHF）中的奖励函数，以提升简洁性和训练稳定性。本文提出SimPO，一种更简单且更有效的方法。SimPO的有效性源于一个关键设计：使用序列的平均对数概率作为隐式奖励。这种奖励公式与模型生成过程更匹配，并且无需参考模型，从而提高了计算和内存效率。此外，我们在Bradley-Terry目标中引入了目标奖励间隔，以扩大获胜与失败回答之间的差异，从而进一步提升算法性能。我们在多种最先进的训练设置下（包括基础模型和指令调优模型，如Mistral和Llama3），将SimPO与DPO及其最新变体进行了比较。我们在广泛的指令遵循基准测试上进行了评估，包括AlpacaEval 2、MT-Bench以及近期具有挑战性的Arena-Hard基准。结果表明，SimPO在不显著增加回答长度的前提下，始终显著优于现有方法。具体而言，在AlpacaEval 2上，SimPO比DPO高出最多6.4分；在Arena-Hard上高出最多7.5分。我们基于Llama3-8B-Instruct构建的最佳模型，在AlpacaEval 2上取得了44.7的长度控制胜率——超越了排行榜上的Claude 3 Opus，在Arena-Hard上取得了33.8的胜率——使其成为目前最强的8B开源模型。