Direct preference optimization methods have emerged as a computationally efficient alternative to Reinforcement Learning from Human Feedback (RLHF) for aligning Large Language Models (LLMs). Latest approaches have streamlined the alignment process by deriving implicit reward functions, yet they often suffer from a critical objective mismatch: optimizing the relative margin between chosen and rejected responses does not guarantee the preservation of the chosen response's absolute likelihood. This can lead to unlearning, where the model degrades the probability of high-quality outputs to satisfy margin constraints, and formatting collapse caused by the over-penalization of rejected sequences. In this work, we introduce SLIME (Stabilized Likelihood Implicit Margin Enforcement), a reference-free alignment objective designed to decouple preference learning from generation quality. SLIME incorporates a three-pronged objective: (1) an anchoring term to maximize the likelihood of preferred responses; (2) a stabilizing penalty that prevents the probabilities of rejected tokens from collapsing to zero; and (3) a dual-margin mechanism that combines hard and soft constraints for precise boundary shaping. Our results demonstrate that SLIME achieves superior performance compared to state-of-the-art baselines while maintaining higher generation stability.
翻译:直接偏好优化方法已成为一种计算高效的替代方案,用于替代基于人类反馈的强化学习(RLHF)来对齐大语言模型(LLM)。最新方法通过推导隐式奖励函数简化了对齐过程,但它们常常面临一个关键的目标失配问题:优化被选与拒绝响应之间的相对边界并不能保证保留被选响应的绝对似然。这可能导致遗忘现象,即模型为满足边界约束而降低高质量输出的概率,以及因对拒绝序列的过度惩罚而导致的格式崩溃。本文提出SLIME(基于稳定似然的隐式边界强化),一种无参考的对齐目标,旨在将偏好学习与生成质量解耦。SLIME包含三重目标:(1)锚定项,用于最大化偏好响应的似然;(2)稳定惩罚项,防止拒绝词元的概率坍缩至零;(3)结合硬约束与软约束的双边界机制,以实现精确的边界塑形。实验结果表明,SLIME在保持更高生成稳定性的同时,实现了优于现有基准方法的性能。