Preference-based feedback is important for many applications in reinforcement learning where direct evaluation of a reward function is not feasible. A notable recent example arises in reinforcement learning from human feedback (RLHF) on large language models. For many applications of RLHF, the cost of acquiring the human feedback can be substantial. In this work, we take advantage of the fact that one can often choose contexts at which to obtain human feedback in order to most efficiently identify a good policy, and formalize this as an offline contextual dueling bandit problem. We give an upper-confidence-bound style algorithm for this problem and prove a polynomial worst-case regret bound. We then provide empirical confirmation in a synthetic setting that our approach outperforms existing methods. After, we extend the setting and methodology for practical use in RLHF training of large language models. Here, our method is able to reach better performance with fewer samples of human preferences than multiple baselines on three real-world datasets.
翻译:基于偏好的反馈在强化学习的许多应用中至关重要,尤其是当直接评估奖励函数不可行时。近期一个显著的例子出现在大型语言模型的人类反馈强化学习(RLHF)中。在许多RLHF应用中,获取人类反馈的成本可能相当高。在本工作中,我们利用能够选择获取人类反馈的情境这一特点,以最高效地识别出优秀策略,并将其形式化为一个离线情境对比决斗老虎机问题。我们为此问题提出了一种上置信界风格的算法,并证明了其多项式最差情况下的遗憾界。随后,我们通过在合成环境中进行实证验证,表明我们的方法优于现有方法。接着,我们将该设置和方法扩展到大型语言模型RLHF训练中的实际应用场景。在此场景下,我们的方法能够在三个真实数据集上,相比于多种基线方法,以更少的人类偏好样本达到更好的性能。