The growing deployment of reinforcement learning from human feedback (RLHF) calls for a deeper theoretical investigation of its underlying models. The prevalent models of RLHF do not account for neuroscience-backed, partially-observed "internal states" that can affect human feedback, nor do they accommodate intermediate feedback during an interaction. Both of these can be instrumental in speeding up learning and improving alignment. To address these limitations, we model RLHF as reinforcement learning with partially observed reward-states (PORRL). We accommodate two kinds of feedback $-$ cardinal and dueling feedback. We first demonstrate that PORRL subsumes a wide class of RL problems, including traditional RL, RLHF, and reward machines. For cardinal feedback, we present two model-based methods (POR-UCRL, POR-UCBVI). We give both cardinal regret and sample complexity guarantees for the methods, showing that they improve over naive history-summarization. We then discuss the benefits of a model-free method like GOLF with naive history-summarization in settings with recursive internal states and dense intermediate feedback. For this purpose, we define a new history aware version of the Bellman-eluder dimension and give a new guarantee for GOLF in our setting, which can be exponentially sharper in illustrative examples. For dueling feedback, we show that a naive reduction to cardinal feedback fails to achieve sublinear dueling regret. We then present the first explicit reduction that converts guarantees for cardinal regret to dueling regret. In both feedback settings, we show that our models and guarantees generalize and extend existing ones.
翻译:随着基于人类反馈的强化学习(RLHF)的日益广泛应用,对其基础模型进行更深入的理论研究显得尤为必要。当前主流的RLHF模型既未考虑神经科学支持的、可能影响人类反馈的部分可观测“内部状态”,也未涵盖交互过程中的中间反馈。这两者对于加速学习过程和改进对齐效果都具有重要作用。为弥补这些不足,我们将RLHF建模为具有部分观测奖励状态(PORRL)的强化学习。我们兼容两种反馈类型——基数反馈和对比反馈。我们首先证明PORRL包含广泛的强化学习问题类别,包括传统RL、RLHF和奖励机。针对基数反馈,我们提出两种基于模型的方法(POR-UCRL、POR-UCBVI)。我们为这些方法提供了基数遗憾和样本复杂度的理论保证,证明其性能优于朴素的历史汇总方法。随后,我们讨论了在具有递归内部状态和密集中间反馈的场景中,采用GOLF等无模型方法配合朴素历史汇总的优势。为此,我们定义了新的历史感知版Bellman-eluder维度,并在本设定中为GOLF提供了新的理论保证,该保证在示例场景中可能呈现指数级优势。对于对比反馈,我们证明直接简化为基数反馈的方法无法实现次线性对比遗憾。进而我们提出了首个显式归约方法,可将基数遗憾的理论保证转化为对比遗憾保证。在两种反馈设定下,我们的模型与理论保证均能推广并扩展现有研究成果。