Direct Preference Optimization (DPO) has emerged as a cornerstone of reinforcement learning from human feedback (RLHF) due to its simplicity and efficiency. However, existing DPO-based methods typically treat all preference pairs equally, overlooking substantial variations in data quality and learning difficulty, which leads to inefficient data utilization and suboptimal performance. To address this limitation, we propose Uni-DPO, a unified dynamic preference optimization framework that jointly considers (a) the inherent quality of preference pairs and (b) the model's evolving performance during training. By adaptively reweighting samples based on both factors, Uni-DPO enables more effective use of preference data and achieves superior performance. Extensive experiments across models and benchmarks demonstrate the effectiveness and generalization of Uni-DPO. On textual tasks, Gemma-2-9B-IT fine-tuned with Uni-DPO surpasses the leading LLM, Claude 3 Opus, by 6.7 points on Arena-Hard. On mathematical and multimodal tasks, Uni-DPO consistently outperforms baseline methods across all benchmarks, providing strong empirical evidence of its effectiveness and robustness.
翻译:直接偏好优化(DPO)因其简洁性和高效性,已成为基于人类反馈的强化学习(RLHF)的基石。然而,现有的基于DPO的方法通常平等对待所有偏好对,忽视了数据质量和学习难度上的显著差异,这导致了数据利用效率低下和性能欠佳。为了解决这一局限,我们提出了Uni-DPO,一个统一的动态偏好优化框架,它联合考虑了(a)偏好对的固有质量,以及(b)模型在训练过程中不断演化的性能。通过基于这两个因素自适应地重新加权样本,Uni-DPO能够更有效地利用偏好数据并获得更优的性能。跨模型和基准的大量实验证明了Uni-DPO的有效性和泛化能力。在文本任务上,使用Uni-DPO微调的Gemma-2-9B-IT在Arena-Hard基准上超越了领先的大语言模型Claude 3 Opus达6.7分。在数学和多模态任务上,Uni-DPO在所有基准测试中均持续优于基线方法,为其有效性和鲁棒性提供了强有力的实证证据。