A Framework for Partially Observed Reward-States in RLHF

The study of reinforcement learning from human feedback (RLHF) has gained prominence in recent years due to its role in the development of LLMs. Neuroscience research shows that human responses to stimuli are known to depend on partially-observed "internal states." Unfortunately current models of RLHF do not take take this into consideration. Moreover most RLHF models do not account for intermediate feedback, which is gaining importance in empirical work and can help improve both sample complexity and alignment. To address these limitations, we model RLHF as reinforcement learning with partially observed reward-states (PORRL). We show reductions from the the two dominant forms of human feedback in RLHF - cardinal and dueling feedback to PORRL. For cardinal feedback, we develop generic statistically efficient algorithms and instantiate them to present POR-UCRL and POR-UCBVI. For dueling feedback, we show that a naive reduction to cardinal feedback fails to achieve sublinear dueling regret. We then present the first explicit reduction that converts guarantees for cardinal regret to dueling regret. We show that our models and guarantees in both settings generalize and extend existing ones. Finally, we identify a recursive structure on our model that could improve the statistical and computational tractability of PORRL, giving examples from past work on RLHF as well as learning perfect reward machines, which PORRL subsumes.

翻译：近年来，基于人类反馈的强化学习（RLHF）因其在大语言模型开发中的关键作用而受到广泛关注。神经科学研究表明，人类对刺激的反应依赖于部分可观测的"内部状态"。然而，当前的RLHF模型并未考虑这一因素。此外，多数RLHF模型未能纳入中间反馈机制，而这一机制在实证研究中日益重要，有助于提升样本效率和对齐效果。为应对这些局限，我们将RLHF建模为具有部分可观测奖励状态的强化学习（PORRL）。我们展示了RLHF中两种主要的人类反馈形式——基数反馈和对偶反馈——到PORRL的归约过程。针对基数反馈，我们开发了通用的统计高效算法，并实例化出POR-UCRL和POR-UCBVI算法。针对对偶反馈，我们证明直接将基数反馈进行归约无法实现次线性对偶遗憾，进而提出了首项显式归约方法，可将基数遗憾的保证转化为对偶遗憾。研究表明，我们在这两种设置下的模型及其保证能够泛化并扩展现有成果。最后，我们识别出模型中的递归结构，该结构可提升PORRL的统计与计算可处理性，并列举了过往RLHF研究及完整奖励机学习的实例（两者均可被PORRL涵盖）。

相关内容

MoDELS

关注 46

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/