Reinforcement Learning from Human Feedback (RLHF) has achieved impressive empirical successes while relying on a small amount of human feedback. However, there is limited theoretical justification for this phenomenon. Additionally, most recent studies focus on value-based algorithms despite the recent empirical successes of policy-based algorithms. In this work, we consider an RLHF algorithm based on policy optimization (PO-RLHF). The algorithm is based on the popular Policy Cover-Policy Gradient (PC-PG) algorithm, which assumes knowledge of the reward function. In PO-RLHF, knowledge of the reward function is not assumed, and the algorithm uses trajectory-based comparison feedback to infer the reward function. We provide performance bounds for PO-RLHF with low query complexity, which provides insight into why a small amount of human feedback may be sufficient to achieve good performance with RLHF. A key novelty is a trajectory-level elliptical potential analysis, which bounds the reward estimation error when comparison feedback (rather than numerical reward observation) is given. We provide and analyze algorithms PG-RLHF and NN-PG-RLHF for two settings: linear and neural function approximation, respectively.
翻译:基于人类反馈的强化学习(RLHF)在依赖少量人类反馈的情况下取得了令人瞩目的实证成功,然而这一现象缺乏充分的理论依据。此外,尽管基于策略的算法近期在实证中表现优异,现有研究仍主要集中于基于价值的算法。本文研究一种基于策略优化的RLHF算法(PO-RLHF)。该算法建立在经典的策略覆盖-策略梯度(PC-PG)算法框架之上,但PC-PG需已知奖励函数,而PO-RLHF不假设已知奖励函数,而是利用基于轨迹的比较反馈来推断奖励函数。我们为PO-RLHF建立了具有低查询复杂度的性能边界,这为解释“为何少量人类反馈足以使RLHF获得良好性能”提供了理论依据。本研究的核心创新在于轨迹层面的椭圆势分析,该分析界定了在获得比较反馈(而非数值化奖励观测)时的奖励估计误差。我们分别针对线性函数逼近和神经网络函数逼近两种场景,提出并分析了PG-RLHF与NN-PG-RLHF算法。