Direct Preference Optimization (DPO) has emerged as a promising approach for aligning large language models with human preferences. While prior work mainly extends DPO from the aspect of the objective function, we instead improve DPO from the largely overlooked but critical aspect of data selection. Specifically, we address the issue of parameter shrinkage caused by noisy data by proposing a novel margin-maximization principle for dataset curation in DPO training. To further mitigate the noise in different reward models, we propose a Bayesian Aggregation approach that unifies multiple margin sources (external and implicit) into a single preference probability. Extensive experiments in diverse settings demonstrate the consistently high data efficiency of our approach. Remarkably, by using just 10\% of the Ultrafeedback dataset, our approach achieves 3\% to 8\% improvements across various Llama, Mistral, and Qwen models on the AlpacaEval2 benchmark. Furthermore, our approach seamlessly extends to iterative DPO, yielding a roughly 3\% improvement with 25\% online data, revealing the high redundancy in this presumed high-quality data construction manner. These results highlight the potential of data selection strategies for advancing preference optimization.
翻译:直接偏好优化已成为将大语言模型与人类偏好对齐的重要方法。尽管先前研究主要从目标函数角度扩展DPO,本文则从长期被忽视但至关重要的数据选择维度改进DPO。具体而言,我们通过提出新颖的边界最大化原则来解决DPO训练中噪声数据导致的参数收缩问题,该原则专门用于数据集筛选。为同时缓解不同奖励模型的噪声干扰,我们提出贝叶斯聚合方法,将多个边界来源(外部与隐式)统一为单一偏好概率。多场景实验表明我们的方法具有持续优异的数据效率。值得注意的是,仅使用Ultrafeedback数据集中10%的数据,我们的方法在AlpacaEval2基准测试中对各类Llama、Mistral和Qwen模型实现了3%至8%的性能提升。此外,该方法可无缝扩展至迭代式DPO,仅用25%的在线数据即可带来约3%的改进,揭示了这种公认高质量数据构建方式中存在的高度冗余。这些结果凸显了数据选择策略在推进偏好优化方面的巨大潜力。