This paper addresses the challenges of aligning large language models (LLMs) with human values via preference learning (PL), with a focus on the issues of incomplete and corrupted data in preference datasets. We propose a novel method for robustly and completely recalibrating values within these datasets to enhance LLMs resilience against the issues. In particular, we devise a guaranteed polynomial time ranking algorithm that robustifies several existing models, such as the classic Bradley--Terry--Luce (BTL) (Bradley and Terry, 1952) model and certain generalizations of it. To the best of our knowledge, our present work is the first to propose an algorithm that provably recovers an {\epsilon}-optimal ranking with high probability while allowing as large as O(n) perturbed pairwise comparison results per model response. Furthermore, we show robust recovery results in the partially observed setting. Our experiments confirm that our algorithms handle adversarial noise and unobserved comparisons well in both general and LLM preference dataset settings. This work contributes to the development and scaling of more reliable and ethically aligned AI models by equipping the dataset curation pipeline with the ability to handle missing and maliciously manipulated inputs.
翻译:本文针对通过偏好学习对齐大型语言模型与人类价值观的挑战,重点关注偏好数据集中数据不完整与被污染的问题。我们提出了一种新方法,能够稳健且完整地重新校准这些数据集中的数值,以增强大型语言模型对抗上述问题的能力。具体而言,我们设计了一种具有多项式时间保证的排序算法,该算法能够增强多种现有模型的鲁棒性,例如经典的Bradley-Terry-Luce(BTL)模型及其若干推广形式。据我们所知,本研究首次提出一种算法,能够在允许每个模型响应存在多达O(n)个受扰动的成对比较结果的情况下,以高概率可证明地恢复出ε-最优排序。此外,我们还在部分观测场景下展示了稳健恢复结果。实验证实,我们的算法在通用场景及大型语言模型偏好数据集场景中均能有效处理对抗性噪声和未观测到的比较。本研究通过赋予数据集清洗流程处理缺失数据和恶意篡改输入的能力,为开发及规模化更可靠且符合伦理的人工智能模型做出了贡献。