Direct Preference Optimisation (DPO) is effective at significantly improving the performance of large language models (LLMs) on downstream tasks such as reasoning, summarisation, and alignment. Using pairs of preferred and dispreferred data, DPO models the relative probability of picking one response over another. In this work, first we show theoretically that the standard DPO loss can lead to a reduction of the model's likelihood of the preferred examples, as long as the relative probability between the preferred and dispreferred classes increases. We then show empirically that this phenomenon occurs when fine-tuning LLMs on common datasets, especially datasets in which the edit distance between pairs of completions is low. Using these insights, we design DPO-Positive (DPOP), a new loss function and training procedure which avoids this failure mode. Surprisingly, we find that DPOP outperforms DPO and other fine-tuning procedures across a wide variety of datasets and downstream tasks, including datasets with high edit distances between completions. Furthermore, we find that the DPOP-tuned model outperforms the DPO-tuned model (all else equal) on benchmarks independent of the fine-tuning data, such as MT-Bench. Finally, using DPOP, we create and open-source Smaug-34B and Smaug-72B, with the latter becoming the first open-source LLM to surpass an average accuracy of 80% on the HuggingFace Open LLM Leaderboard.
翻译:直接偏好优化(DPO)能显著提升大语言模型在推理、摘要和对齐等下游任务上的性能。该方法通过使用偏好与非偏好的数据对,建模选择一种响应相对于另一种响应的相对概率。本文首先从理论上证明,只要偏好类与非偏好类之间的相对概率增加,标准DPO损失可能导致模型对偏好示例的似然降低。随后通过实验验证,在对大语言模型进行常见数据集微调时(尤其是数据集中成对补全文本的编辑距离较小的情况)确实会出现此现象。基于这些发现,我们设计了DPO-Positive(DPOP)——一种能规避该失效模式的新损失函数与训练流程。令人惊讶的是,DPOP在包括补全文本间编辑距离较大的数据集在内的多种数据集和下游任务中,均优于DPO及其他微调方法。此外,在独立于微调数据的基准测试(如MT-Bench)中,DPOP调优模型在同等条件下也优于DPO调优模型。最后,基于DPOP方法,我们创建并开源了Smaug-34B与Smaug-72B模型,其中后者成为首个在HuggingFace开放大语言模型排行榜上平均准确率突破80%的开源大语言模型。