Direct Preference Optimization (DPO) has emerged as a more computationally efficient alternative to Reinforcement Learning from Human Feedback (RLHF) with Proximal Policy Optimization (PPO), eliminating the need for reward models and online sampling. Despite these benefits, DPO and its variants remain sensitive to hyper-parameters and prone to instability, particularly on mathematical datasets. We argue that these issues arise from the unidirectional likelihood-derivative negative feedback inherent in the log-likelihood loss function. To address this, we propose a novel LLM alignment loss that establishes a stable Bidirectional Negative Feedback (BNF) during optimization. Our proposed BNF loss eliminates the need for pairwise contrastive losses and does not require any extra tunable hyper-parameters or pairwise preference data, streamlining the alignment pipeline to be as simple as supervised fine-tuning. We conduct extensive experiments across two challenging QA benchmarks and four reasoning benchmarks. The experimental results show that BNF achieves comparable performance to the best methods on QA benchmarks, while its performance decrease on the four reasoning benchmarks is significantly lower compared to the best methods, thus striking a better balance between value alignment and reasoning ability. In addition, we further validate the performance of BNF on non-pairwise datasets, and conduct in-depth analysis of log-likelihood and logit shifts across different preference optimization methods.
翻译:直接偏好优化(DPO)已成为基于近端策略优化(PPO)的人类反馈强化学习(RLHF)的一种计算效率更高的替代方案,它消除了对奖励模型和在线采样的需求。尽管具有这些优势,DPO及其变体仍然对超参数敏感且容易不稳定,尤其是在数学数据集上。我们认为,这些问题源于对数似然损失函数中固有的单向似然导数负反馈。为了解决这个问题,我们提出了一种新颖的大语言模型对齐损失,该损失在优化过程中建立了稳定的双向负反馈(BNF)。我们提出的BNF损失无需成对对比损失,也不需要任何额外的可调超参数或成对偏好数据,从而将对齐流程简化为与监督微调一样简单。我们在两个具有挑战性的问答基准和四个推理基准上进行了广泛的实验。实验结果表明,BNF在问答基准上达到了与最佳方法相当的性能,而在四个推理基准上的性能下降幅度显著低于最佳方法,从而在价值对齐和推理能力之间取得了更好的平衡。此外,我们进一步验证了BNF在非成对数据集上的性能,并对不同偏好优化方法下的对数似然和逻辑偏移进行了深入分析。