The growing deployment of reinforcement learning from human feedback (RLHF) calls for a deeper theoretical investigation of its underlying models. The prevalent models of RLHF do not account for neuroscience-backed, partially-observed "internal states" that can affect human feedback, nor do they accommodate intermediate feedback during an interaction. Both of these can be instrumental in speeding up learning and improving alignment. To address these limitations, we model RLHF as reinforcement learning with partially observed reward-states (PORRL). We accommodate two kinds of feedback $-$ cardinal and dueling feedback. We first demonstrate that PORRL subsumes a wide class of RL problems, including traditional RL, RLHF, and reward machines. For cardinal feedback, we present two model-based methods (POR-UCRL, POR-UCBVI). We give both cardinal regret and sample complexity guarantees for the methods, showing that they improve over naive history-summarization. We then discuss the benefits of a model-free method like GOLF with naive history-summarization in settings with recursive internal states and dense intermediate feedback. For this purpose, we define a new history aware version of the Bellman-eluder dimension and give a new guarantee for GOLF in our setting, which can be exponentially sharper in illustrative examples. For dueling feedback, we show that a naive reduction to cardinal feedback fails to achieve sublinear dueling regret. We then present the first explicit reduction that converts guarantees for cardinal regret to dueling regret. In both feedback settings, we show that our models and guarantees generalize and extend existing ones.
翻译:随着基于人类反馈的强化学习(RLHF)的日益广泛应用,对其基础模型进行更深入的理论研究变得至关重要。当前主流的RLHF模型既未考虑神经科学支持的、可能影响人类反馈的部分可观测“内部状态”,也未涵盖交互过程中的中间反馈。这两者对于加速学习过程与提升对齐效果都具有重要作用。为克服这些局限性,本文将RLHF建模为具有部分可观测奖励状态的强化学习(PORRL)。我们兼容两种反馈类型——基数反馈与对决反馈。我们首先证明PORRL涵盖广泛的强化学习问题类别,包括传统强化学习、RLHF及奖励机制。针对基数反馈,我们提出两种基于模型的方法(POR-UCRL、POR-UCBVI)。我们为这些方法提供了基数遗憾与样本复杂度的理论保证,证明其性能优于朴素的历史信息汇总方法。随后,我们探讨了在具有递归内部状态和密集中间反馈的场景中,采用GOLF等无模型方法配合朴素历史信息汇总的优势。为此,我们定义了新的历史感知版贝尔曼-规避维度,并为GOLF在本研究设定下提供了新的理论保证——在示例性场景中该保证可能呈现指数级提升。针对对决反馈,我们证明将其简单归约为基数反馈无法实现次线性对决遗憾。继而提出首个显式归约方法,可将基数遗憾的理论保证转化为对决遗憾保证。在两种反馈设定下,我们的模型与理论保证均展现出对现有研究的泛化与拓展能力。