Direct preference optimization methods have emerged as a computationally efficient alternative to Reinforcement Learning from Human Feedback (RLHF) for aligning Large Language Models (LLMs). Latest approaches have streamlined the alignment process by deriving implicit reward functions, yet they often suffer from a critical objective mismatch: optimizing the relative margin between chosen and rejected responses does not guarantee the preservation of the chosen response's absolute likelihood. This can lead to ``unlearning'', where the model degrades the probability of high-quality outputs to satisfy margin constraints, and ``formatting collapse'' caused by the over-penalization of rejected sequences. In this work, we introduce SLIME (Stabilized Likelihood Implicit Margin Enforcement), a reference-free alignment objective designed to decouple preference learning from generation quality. SLIME incorporates a three-pronged objective: (1) an anchoring term to maximize the likelihood of preferred responses; (2) a stabilizing penalty that prevents the probabilities of rejected tokens from collapsing to zero; and (3) a dual-margin mechanism that combines hard and soft constraints for precise boundary shaping. Our results demonstrate that SLIME achieves superior performance compared to state-of-the-art baselines while maintaining higher generation stability.
翻译:直接偏好优化方法已成为替代基于人类反馈的强化学习(RLHF)以对齐大语言模型(LLM)的一种计算高效方案。最新方法通过推导隐式奖励函数简化了对齐过程,但它们常面临一个关键的目标失配问题:优化被选与拒绝回复间的相对边界并不能保证被选回复绝对似然的保持。这可能导致“遗忘”现象——模型为满足边界约束而降低高质量输出的概率,以及因对拒绝序列过度惩罚而引发的“格式崩溃”。本文提出SLIME(基于稳定似然的隐式边界强化),一种无参考的对齐目标函数,旨在解耦偏好学习与生成质量。SLIME包含三重目标:(1)锚定项以最大化偏好回复的似然;(2)稳定惩罚项以防止拒绝标记的概率坍缩至零;(3)结合硬约束与软约束的双边界机制,用于精确的边界塑形。实验结果表明,SLIME在保持更高生成稳定性的同时,相比现有先进基线方法实现了更优的性能。