Provably Robust DPO: Aligning Language Models with Noisy Feedback

Learning from preference-based feedback has recently gained traction as a promising approach to align language models with human interests. While these aligned generative models have demonstrated impressive capabilities across various tasks, their dependence on high-quality human preference data poses a bottleneck in practical applications. Specifically, noisy (incorrect and ambiguous) preference pairs in the dataset might restrict the language models from capturing human intent accurately. While practitioners have recently proposed heuristics to mitigate the effect of noisy preferences, a complete theoretical understanding of their workings remain elusive. In this work, we aim to bridge this gap by by introducing a general framework for policy optimization in the presence of random preference flips. We focus on the direct preference optimization (DPO) algorithm in particular since it assumes that preferences adhere to the Bradley-Terry-Luce (BTL) model, raising concerns about the impact of noisy data on the learned policy. We design a novel loss function, which de-bias the effect of noise on average, making a policy trained by minimizing that loss robust to the noise. Under log-linear parameterization of the policy class and assuming good feature coverage of the SFT policy, we prove that the sub-optimality gap of the proposed robust DPO (rDPO) policy compared to the optimal policy is of the order $O(\frac{1}{1-2\epsilon}\sqrt{\frac{d}{n}})$, where $\epsilon < 1/2$ is flip rate of labels, $d$ is policy parameter dimension and $n$ is size of dataset. Our experiments on IMDb sentiment generation and Anthropic's helpful-harmless dataset show that rDPO is robust to noise in preference labels compared to vanilla DPO and other heuristics proposed by practitioners.

翻译：基于偏好的反馈学习近期作为一种对齐语言模型与人类兴趣的有前景方法受到关注。尽管这些对齐生成模型在各种任务中展现了令人印象深刻的能力，但它们对高质量人类偏好数据的依赖性在实际应用中构成了瓶颈。具体而言，数据集中含噪（不正确和模糊）的偏好对可能限制语言模型准确捕捉人类意图。虽然实践者近期提出了缓解含噪偏好影响的启发式方法，但对其工作原理的完整理论理解仍难以捉摸。在本工作中，我们旨在通过引入一个在随机偏好翻转存在下的策略优化通用框架来弥合这一差距。我们特别关注直接偏好优化（DPO）算法，因为它假设偏好遵循Bradley-Terry-Luce（BTL）模型，这引发了对含噪数据影响所学策略的担忧。我们设计了一种新颖的损失函数，该函数能平均消除噪声的影响，使得通过最小化该损失训练的策略对噪声具有鲁棒性。在策略类的对数线性参数化假设下，并假设SFT策略具有良好的特征覆盖，我们证明了所提出的鲁棒DPO（rDPO）策略与最优策略之间的次优性差距为$O(\frac{1}{1-2\epsilon}\sqrt{\frac{d}{n}})$，其中$\epsilon < 1/2$是标签翻转率，$d$是策略参数维度，$n$是数据集大小。我们在IMDb情感生成和Anthropic的无害-有益数据集上的实验表明，与原始DPO和从业者提出的其他启发式方法相比，rDPO对偏好标签中的噪声具有鲁棒性。