This study addresses the challenge of noise in training datasets for Direct Preference Optimization (DPO), a method for aligning Large Language Models (LLMs) with human preferences. We categorize noise into pointwise noise, which includes low-quality data points, and pairwise noise, which encompasses erroneous data pair associations that affect preference rankings. Utilizing Distributionally Robust Optimization (DRO), we enhance DPO's resilience to these types of noise. Our theoretical insights reveal that DPO inherently embeds DRO principles, conferring robustness to pointwise noise, with the regularization coefficient $\beta$ playing a critical role in its noise resistance. Extending this framework, we introduce Distributionally Robustifying DPO (Dr. DPO), which integrates pairwise robustness by optimizing against worst-case pairwise scenarios. The novel hyperparameter $\beta'$ in Dr. DPO allows for fine-tuned control over data pair reliability, providing a strategic balance between exploration and exploitation in noisy training environments. Empirical evaluations demonstrate that Dr. DPO substantially improves the quality of generated text and response accuracy in preference datasets, showcasing enhanced performance in both noisy and noise-free settings. The code is available at https://github.com/junkangwu/Dr_DPO.
翻译:本研究针对直接偏好优化(DPO)方法在训练数据集中的噪声问题展开探讨。DPO是一种将大语言模型(LLMs)与人类偏好对齐的技术。我们将噪声分为两类:点态噪声(包含低质量数据点)和配对噪声(涉及影响偏好排序的错误数据对关联)。通过采用分布鲁棒优化(DRO)框架,我们增强了DPO对这两类噪声的鲁棒性。理论分析表明,DPO本质上已蕴含DRO原理,使其对点态噪声具有内在鲁棒性,其中正则化系数$\beta$在抗噪性能中起关键作用。基于此框架,我们提出分布鲁棒化DPO(Dr. DPO),通过优化最坏情况下的配对场景来实现配对鲁棒性。Dr. DPO中引入的新超参数$\beta'$可精细调控数据对可靠性,为噪声训练环境中的探索与利用提供了策略性平衡。实证评估表明,Dr. DPO在生成文本质量和偏好数据集响应准确性方面均有显著提升,在含噪与无噪环境中均展现出优越性能。代码已开源:https://github.com/junkangwu/Dr_DPO。