The study of reinforcement learning from human feedback (RLHF) has gained prominence in recent years due to its role in the development of LLMs. Neuroscience research shows that human responses to stimuli are known to depend on partially-observed "internal states." Unfortunately current models of RLHF do not take take this into consideration. Moreover most RLHF models do not account for intermediate feedback, which is gaining importance in empirical work and can help improve both sample complexity and alignment. To address these limitations, we model RLHF as reinforcement learning with partially observed reward-states (PORRL). We show reductions from the the two dominant forms of human feedback in RLHF - cardinal and dueling feedback to PORRL. For cardinal feedback, we develop generic statistically efficient algorithms and instantiate them to present POR-UCRL and POR-UCBVI. For dueling feedback, we show that a naive reduction to cardinal feedback fails to achieve sublinear dueling regret. We then present the first explicit reduction that converts guarantees for cardinal regret to dueling regret. We show that our models and guarantees in both settings generalize and extend existing ones. Finally, we identify a recursive structure on our model that could improve the statistical and computational tractability of PORRL, giving examples from past work on RLHF as well as learning perfect reward machines, which PORRL subsumes.
翻译:近年来,基于人类反馈的强化学习(RLHF)因其在大语言模型开发中的关键作用而受到广泛关注。神经科学研究表明,人类对刺激的反应依赖于部分可观测的"内部状态"。然而,当前的RLHF模型并未考虑这一因素。此外,多数RLHF模型未能纳入中间反馈机制,而这一机制在实证研究中日益重要,有助于提升样本效率和对齐效果。为应对这些局限,我们将RLHF建模为具有部分可观测奖励状态的强化学习(PORRL)。我们展示了RLHF中两种主要的人类反馈形式——基数反馈和对偶反馈——到PORRL的归约过程。针对基数反馈,我们开发了通用的统计高效算法,并实例化出POR-UCRL和POR-UCBVI算法。针对对偶反馈,我们证明直接将基数反馈进行归约无法实现次线性对偶遗憾,进而提出了首项显式归约方法,可将基数遗憾的保证转化为对偶遗憾。研究表明,我们在这两种设置下的模型及其保证能够泛化并扩展现有成果。最后,我们识别出模型中的递归结构,该结构可提升PORRL的统计与计算可处理性,并列举了过往RLHF研究及完整奖励机学习的实例(两者均可被PORRL涵盖)。