In this paper, we study offline Reinforcement Learning with Human Feedback (RLHF) where we aim to learn the human's underlying reward and the MDP's optimal policy from a set of trajectories induced by human choices. RLHF is challenging for multiple reasons: large state space but limited human feedback, the bounded rationality of human decisions, and the off-policy distribution shift. In this paper, we focus on the Dynamic Discrete Choice (DDC) model for modeling and understanding human choices. DCC, rooted in econometrics and decision theory, is widely used to model a human decision-making process with forward-looking and bounded rationality. We propose a \underline{D}ynamic-\underline{C}hoice-\underline{P}essimistic-\underline{P}olicy-\underline{O}ptimization (DCPPO) method. \ The method involves a three-stage process: The first step is to estimate the human behavior policy and the state-action value function via maximum likelihood estimation (MLE); the second step recovers the human reward function via minimizing Bellman mean squared error using the learned value functions; the third step is to plug in the learned reward and invoke pessimistic value iteration for finding a near-optimal policy. With only single-policy coverage (i.e., optimal policy) of the dataset, we prove that the suboptimality of DCPPO almost matches the classical pessimistic offline RL algorithm in terms of suboptimality's dependency on distribution shift and dimension. To the best of our knowledge, this paper presents the first theoretical guarantees for off-policy offline RLHF with dynamic discrete choice model.
翻译:本文研究离线环境下基于人类反馈的强化学习(RLHF),旨在从一组由人类选择引发的轨迹中学习人类的潜在奖励函数及马尔可夫决策过程的最优策略。RLHF面临多重挑战:大规模状态空间但有限的人类反馈、人类决策的有界理性,以及离线策略分布偏移。本文聚焦于动态离散选择(DDC)模型来建模和理解人类选择。DDC模型根植于计量经济学与决策理论,被广泛用于刻画具有前瞻性和有限理性的人类决策过程。我们提出了一种名为动态-选择-悲观-策略-优化(DCPPO)的方法。该方法包含三个阶段:第一步通过极大似然估计(MLE)估计人类行为策略与状态-动作价值函数;第二步利用已学习到的价值函数,通过最小化贝尔曼均方误差来恢复人类奖励函数;第三步将学习到的奖励代入,并调用悲观值迭代方法以寻找近似最优策略。在仅需数据集覆盖单一策略(即最优策略)的条件下,我们证明DCPPO的次优性在分布偏移和维度的依赖性方面几乎与经典悲观离线强化学习算法相匹配。据我们所知,本文首次为基于动态离散选择模型的离线策略RLHF提供了理论保证。