Direct Preference Optimization (DPO) has proven effective in aligning large language models with human preferences but is often constrained to pairwise comparisons -- overlooking additional positive and negative responses that are commonly available in real-world settings. We propose Simultaneous Weighted Preference Optimization (SWEPO), which incorporates multiple responses per query and prioritizes those that deviate most from the average reward. This deviation-based weighting focuses training on the most informative outliers, akin to a built-in curriculum. Theoretically, we prove that such multi-preference sampling lowers alignment bias, bounding the expected deviation from the true acceptable-response distribution at a rate of $\mathcal{O}(\tfrac{1}{\sqrt{k}})$. Empirically, SWEPO outperforms state-of-the-art baselines on the Ultra-Feedback dataset and demonstrates substantial improvements over DPO and InfoNCA, yielding boosts of up to $\sim 4$% on length-controlled win-rate on AlpacaEval.
翻译:直接偏好优化(DPO)在将大语言模型与人类偏好对齐方面已被证明有效,但通常局限于成对比较——忽略了现实场景中普遍存在的额外正负响应。我们提出同步加权偏好优化(SWEPO),该方法为每个查询纳入多个响应,并优先考虑那些与平均奖励偏差最大的响应。这种基于偏差的加权机制将训练聚焦于最具信息量的异常样本,类似于内置的课程学习。理论上,我们证明此类多偏好采样能够降低对齐偏差,将期望偏离真实可接受响应分布的程度约束在 $\mathcal{O}(\tfrac{1}{\sqrt{k}})$ 的速率内。实证结果表明,SWEPO 在 Ultra-Feedback 数据集上优于现有基线方法,并在 DPO 和 InfoNCA 基础上实现显著提升,在 AlpacaEval 的长度控制胜率指标上获得最高约 4% 的增益。