Reinforcement learning from human feedback (RLHF) has emerged as a key technique for aligning the output of large language models (LLMs) with human preferences. To learn the reward function, most existing RLHF algorithms use the Bradley-Terry model, which relies on assumptions about human preferences that may not reflect the complexity and variability of real-world judgments. In this paper, we propose a robust algorithm to enhance the performance of existing approaches under such reward model misspecifications. Theoretically, our algorithm reduces the variance of reward and policy estimators, leading to improved regret bounds. Empirical evaluations on LLM benchmark datasets demonstrate that the proposed algorithm consistently outperforms existing methods, with 77-81% of responses being favored over baselines on the Anthropic Helpful and Harmless dataset. The code is available at https://github.com/VRPO/VRPO.
翻译:人类反馈强化学习(RLHF)已成为使大语言模型(LLMs)输出与人类偏好对齐的关键技术。为学习奖励函数,现有大多数RLHF算法采用Bradley-Terry模型,该模型依赖于对人类偏好的假设,而这些假设可能无法反映现实世界判断的复杂性和可变性。本文提出一种鲁棒算法,以在此类奖励模型设定错误的情况下增强现有方法的性能。理论上,该算法降低了奖励与策略估计量的方差,从而改进了遗憾界。在大语言模型基准数据集上的实证评估表明,所提算法持续优于现有方法,在Anthropic Helpful and Harmless数据集上,77-81%的生成结果优于基线模型。代码发布于 https://github.com/VRPO/VRPO。