Reinforcement learning from human feedback (RLHF) provides a principled framework for aligning AI systems with human preference data. For various reasons, e.g., personal bias, context ambiguity, lack of training, etc, human annotators may give incorrect or inconsistent preference labels. To tackle this challenge, we propose a robust RLHF approach -- $R^3M$, which models the potentially corrupted preference label as sparse outliers. Accordingly, we formulate the robust reward learning as an $\ell_1$-regularized maximum likelihood estimation problem. Computationally, we develop an efficient alternating optimization algorithm, which only incurs negligible computational overhead compared with the standard RLHF approach. Theoretically, we prove that under proper regularity conditions, $R^3M$ can consistently learn the underlying reward and identify outliers, provided that the number of outlier labels scales sublinearly with the preference sample size. Furthermore, we remark that $R^3M$ is versatile and can be extended to various preference optimization methods, including direct preference optimization (DPO). Our experiments on robotic control and natural language generation with large language models (LLMs) show that $R^3M$ improves robustness of the reward against several types of perturbations to the preference data.
翻译:从人类反馈中进行强化学习(RLHF)为将人工智能系统与人类偏好数据对齐提供了一个原则性框架。由于各种原因,例如个人偏见、上下文歧义、缺乏训练等,人类标注者可能会给出不正确或不一致的偏好标签。为应对这一挑战,我们提出了一种鲁棒的RLHF方法——$R^3M$,该方法将可能被污染的偏好标签建模为稀疏异常值。相应地,我们将鲁棒的奖励学习表述为一个$\ell_1$正则化的最大似然估计问题。在计算上,我们开发了一种高效的交替优化算法,与标准RLHF方法相比,仅产生可忽略的计算开销。理论上,我们证明在适当的正则性条件下,只要异常标签的数量随偏好样本量呈次线性增长,$R^3M$就能一致地学习底层奖励并识别异常值。此外,我们指出$R^3M$具有通用性,可扩展到各种偏好优化方法,包括直接偏好优化(DPO)。我们在机器人控制和大语言模型(LLMs)自然语言生成上的实验表明,$R^3M$提升了奖励对偏好数据多种类型扰动的鲁棒性。