We introduce Simultaneous Weighted Preference Optimization (SWEPO), a novel extension of Direct Preference Optimization (DPO) designed to accommodate multiple dynamically chosen positive and negative responses for each query. SWEPO employs a weighted group contrastive loss, assigning weights to responses based on their deviation from the mean reward score. This approach effectively prioritizes responses that are significantly better or worse than the average, enhancing optimization. Our theoretical analysis demonstrates that simultaneously considering multiple preferences reduces alignment bias, resulting in more robust alignment. Additionally, we provide insights into the training dynamics of our loss function and a related function, InfoNCA. Empirical validation on the UltraFeedback dataset establishes SWEPO as state-of-the-art, with superior performance in downstream evaluations using the AlpacaEval dataset.
翻译:本文提出同步加权偏好优化(SWEPO),这是直接偏好优化(DPO)的一种新颖扩展,旨在为每个查询同时适应多个动态选择的正向与负向响应。SWEPO采用加权群体对比损失函数,根据各响应与平均奖励分数的偏离程度为其分配权重。该方法能有效优先优化那些显著优于或劣于平均水平的响应,从而增强优化效果。我们的理论分析表明,同时考虑多重偏好可减少对齐偏差,实现更稳健的对齐。此外,我们深入探讨了所提损失函数及相关函数InfoNCA的训练动态。基于UltraFeedback数据集的实证验证表明,SWEPO在AlpacaEval数据集的下游评估中表现出优越性能,达到了当前最优水平。