Despite the importance of aligning language models with human preferences, crowd-sourced human feedback is often noisy -- for example, preferring less desirable responses -- posing a fundamental challenge to alignment. A truly robust alignment objective should yield identical model parameters even under severe label noise, a property known as redescending. We prove that no existing alignment methods satisfy this property. To address this, we propose H\"older-DPO, the first principled alignment loss with a provable redescending property, enabling estimation of the clean data distribution from noisy feedback. The aligned model estimates the likelihood of clean data, providing a theoretically grounded metric for dataset valuation that identifies the location and fraction of mislabels. This metric is gradient-free, enabling scalable and automated human feedback valuation without costly manual verification or clean validation dataset. H\"older-DPO achieves state-of-the-art robust alignment performance while accurately detecting mislabels in controlled datasets. Finally, applied to Anthropic HH-RLHF dataset, it reveals substantial noise levels and removing these mislabels significantly improves alignment performance across methods. The code is available at https://github.com/ma921/HolderDPO.
翻译:尽管将语言模型与人类偏好对齐至关重要,但众包的人类反馈通常存在噪声——例如,偏好较不理想的回复——这对对齐构成了根本性挑战。真正鲁棒的对齐目标应在严重标签噪声下仍能产生相同的模型参数,这一特性称为重降性。我们证明现有对齐方法均不满足此特性。为此,我们提出H\"older-DPO,这是首个具有可证明重降性的原理性对齐损失函数,能够从噪声反馈中估计干净数据分布。对齐后的模型可估计干净数据的似然,为数据集评估提供了理论依据的度量标准,该标准能识别错误标签的位置和比例。此度量无需梯度计算,实现了可扩展且自动化的人类反馈评估,无需昂贵的人工验证或干净的验证数据集。H\"older-DPO在受控数据集中实现了最先进的鲁棒对齐性能,同时能准确检测错误标签。最后,在Anthropic HH-RLHF数据集上的应用表明,该数据集存在显著噪声水平,移除这些错误标签能显著提升各类对齐方法的性能。代码发布于https://github.com/ma921/HolderDPO。