Direct Preference Optimization (DPO) has become a popular method for fine-tuning large language models (LLMs) due to its stability and simplicity. However, it is also known to be sensitive to noise in the data and prone to overfitting. Recent works have proposed using distributionally robust optimization (DRO) to address potential noise and distributional shift in the data. However, these methods often suffer from excessive conservatism and high computational cost. We propose DPO-PRO (DPO with Preference Robustness), a robust fine-tuning algorithm based on DPO which accounts for uncertainty in the preference distribution through a lightweight DRO formulation. Unlike prior DRO-based variants, DPO-PRO focuses solely on uncertainty in preferences, avoiding unnecessary conservatism and incurring negligible computational overhead. We further show that DPO-PRO is equivalent to a regularized DPO objective that penalizes model overconfidence under weak preference signals. We evaluate DPO-PRO on standard alignment benchmarks and a real-world public health task. Experimental results show that our method consistently improves robustness to noisy preference signals compared to existing DPO variants.
翻译:直接偏好优化(DPO)因其稳定性和简洁性,已成为微调大型语言模型(LLMs)的常用方法。然而,该方法对数据噪声较为敏感,且容易过拟合。近期研究提出使用分布鲁棒优化(DRO)来应对数据中潜在的噪声和分布偏移,但这些方法往往存在过度保守和计算成本高昂的问题。我们提出了DPO-PRO(具有偏好鲁棒性的DPO),这是一种基于DPO的鲁棒微调算法,通过轻量级DRO公式来考虑偏好分布的不确定性。与先前基于DRO的变体不同,DPO-PRO仅关注偏好的不确定性,避免了不必要的保守性,且计算开销可忽略不计。我们进一步证明,DPO-PRO等价于一个正则化的DPO目标,该目标在弱偏好信号下惩罚模型的过度自信。我们在标准对齐基准测试和一项真实世界公共卫生任务上评估了DPO-PRO。实验结果表明,与现有DPO变体相比,我们的方法能持续提升对噪声偏好信号的鲁棒性。