Preference alignment is pivotal for empowering large language models (LLMs) to generate helpful and harmless responses. However, the performance of preference alignment is highly sensitive to the prevalent noise in the preference data. Recent efforts for this problem either marginally alleviate the impact of noise without the ability to actually reduce its presence, or rely on costly teacher LLMs prone to reward misgeneralization. To address these challenges, we propose the RObust Preference Optimization (ROPO) framework, an iterative alignment approach that integrates noise-tolerance and filtering of noisy samples without the aid of external models. Specifically, ROPO iteratively solves a constrained optimization problem, where we dynamically assign a quality-aware weight for each sample and constrain the sum of the weights to the number of samples we intend to retain. For noise-tolerant training and effective noise identification, we derive a robust loss by suppressing the gradients of samples with high uncertainty. We demonstrate both empirically and theoretically that the derived loss is critical for distinguishing noisy samples from clean ones. Furthermore, inspired by our derived loss, we propose a robustness-guided rejection sampling technique to compensate for the potential important information in discarded queries. Experiments on three widely-used datasets with Mistral-7B and Llama-2-7B demonstrate that ROPO significantly outperforms existing preference alignment methods, with its superiority growing as the noise rate increases.
翻译:偏好对齐对于赋能大语言模型(LLM)生成有益且无害的响应至关重要。然而,偏好对齐的性能对偏好数据中普遍存在的噪声高度敏感。针对此问题的近期研究要么仅能轻微缓解噪声影响而无法实际减少其存在,要么依赖于易出现奖励泛化错误的昂贵教师LLM。为应对这些挑战,我们提出了鲁棒偏好优化(ROPO)框架,这是一种集成了噪声容忍与噪声样本过滤的迭代对齐方法,无需借助外部模型。具体而言,ROPO通过迭代求解一个约束优化问题实现动态优化,其中我们为每个样本动态分配一个质量感知权重,并将权重总和约束为我们预期保留的样本数量。为实现噪声容忍训练与有效噪声识别,我们通过抑制高不确定性样本的梯度推导出一种鲁棒损失函数。我们通过实验与理论证明,该推导出的损失函数对于区分噪声样本与干净样本至关重要。此外,受所推导损失函数的启发,我们提出一种鲁棒性引导的拒绝采样技术,以补偿被丢弃查询中可能蕴含的重要信息。在Mistral-7B与Llama-2-7B模型上对三个广泛使用数据集的实验表明,ROPO显著优于现有偏好对齐方法,且其优势随着噪声率的增加而持续增强。