In this paper, we study reinforcement learning from human feedback (RLHF) under an episodic Markov decision process with a general trajectory-wise reward model. We developed a model-free RLHF best policy identification algorithm, called $\mathsf{BSAD}$, without explicit reward model inference, which is a critical intermediate step in the contemporary RLHF paradigms for training large language models (LLM). The algorithm identifies the optimal policy directly from human preference information in a backward manner, employing a dueling bandit sub-routine that constantly duels actions to identify the superior one. $\mathsf{BSAD}$ adopts a reward-free exploration and best-arm-identification-like adaptive stopping criteria to equalize the visitation among all states in the same decision step while moving to the previous step as soon as the optimal action is identifiable, leading to a provable, instance-dependent sample complexity $\tilde{\mathcal{O}}(c_{\mathcal{M}}SA^3H^3M\log\frac{1}{\delta})$ which resembles the result in classic RL, where $c_{\mathcal{M}}$ is the instance-dependent constant and $M$ is the batch size. Moreover, $\mathsf{BSAD}$ can be transformed into an explore-then-commit algorithm with logarithmic regret and generalized to discounted MDPs using a frame-based approach. Our results show: (i) sample-complexity-wise, RLHF is not significantly harder than classic RL and (ii) end-to-end RLHF may deliver improved performance by avoiding pitfalls in reward inferring such as overfit and distribution shift.
翻译:本文研究基于人类反馈的强化学习(RLHF),该问题设定在具有通用轨迹级奖励模型的幕式马尔可夫决策过程框架下。我们开发了一种无模型的RLHF最优策略识别算法,称为$\mathsf{BSAD}$,该算法无需显式的奖励模型推断——而奖励模型推断是当前训练大语言模型(LLM)的RLHF范式中的关键中间步骤。该算法以反向方式直接从人类偏好信息中识别最优策略,采用了一个对决赌博机子程序,该子程序持续对决动作以识别更优者。$\mathsf{BSAD}$采用无奖励探索和类似于最佳臂识别的自适应停止准则,以均衡同一决策步中所有状态的访问频率,同时一旦最优动作可识别便立即移至前一步,从而获得一个可证明的、实例相关的样本复杂度$\tilde{\mathcal{O}}(c_{\mathcal{M}}SA^3H^3M\log\frac{1}{\delta})$,该结果与经典RL中的结果相似,其中$c_{\mathcal{M}}$是实例相关常数,$M$是批大小。此外,$\mathsf{BSAD}$可以转化为具有对数遗憾的“探索-然后-提交”算法,并可通过基于帧的方法推广到折扣MDP。我们的结果表明:(i)就样本复杂度而言,RLHF并不比经典RL显著更难;(ii)端到端的RLHF可能通过避免奖励推断中的过拟合和分布偏移等陷阱来提供更好的性能。