Reinforcement learning from Human Feedback (RLHF) learns from preference signals, while standard Reinforcement Learning (RL) directly learns from reward signals. Preferences arguably contain less information than rewards, which makes preference-based RL seemingly more difficult. This paper theoretically proves that, for a wide range of preference models, we can solve preference-based RL directly using existing algorithms and techniques for reward-based RL, with small or no extra costs. Specifically, (1) for preferences that are drawn from reward-based probabilistic models, we reduce the problem to robust reward-based RL that can tolerate small errors in rewards; (2) for general arbitrary preferences where the objective is to find the von Neumann winner, we reduce the problem to multiagent reward-based RL which finds Nash equilibria for factored Markov games with a restricted set of policies. The latter case can be further reduced to adversarial MDP when preferences only depend on the final state. We instantiate all reward-based RL subroutines by concrete provable algorithms, and apply our theory to a large class of models including tabular MDPs and MDPs with generic function approximation. We further provide guarantees when K-wise comparisons are available.
翻译:从人类反馈中学习的强化学习(RLHF)从偏好信号中学习,而标准强化学习(RL)直接从奖励信号中学习。偏好显然包含的信息少于奖励,这使得基于偏好的RL看似更困难。本文从理论上证明,对于广泛的一类偏好模型,我们可以直接使用现有的基于奖励的RL算法和技术来解决基于偏好的RL问题,且附加成本很小甚至为零。具体而言:(1) 对于从基于奖励的概率模型中抽取的偏好,我们将问题简化为能够容忍奖励微小误差的鲁棒性基于奖励的RL;(2) 对于目标为寻找冯·诺伊曼胜者的一般任意偏好,我们将问题简化为多智能体基于奖励的RL,该RL在具有受限策略集的因子化马尔可夫博弈中寻找纳什均衡。当偏好仅依赖于最终状态时,后一种情况可进一步简化为对抗性MDP。我们用具体可证明的算法实例化了所有基于奖励的RL子程序,并将我们的理论应用于包括表格型MDP和具有通用函数逼近的MDP在内的大类模型。此外,我们还提供了当K元比较可用时的理论保证。