In this paper, we study offline Reinforcement Learning with Human Feedback (RLHF) where we aim to learn the human's underlying reward and the MDP's optimal policy from a set of trajectories induced by human choices. RLHF is challenging for multiple reasons: large state space but limited human feedback, the bounded rationality of human decisions, and the off-policy distribution shift. In this paper, we focus on the Dynamic Discrete Choice (DDC) model for modeling and understanding human choices. DCC, rooted in econometrics and decision theory, is widely used to model a human decision-making process with forward-looking and bounded rationality. We propose a \underline{D}ynamic-\underline{C}hoice-\underline{P}essimistic-\underline{P}olicy-\underline{O}ptimization (DCPPO) method. \ The method involves a three-stage process: The first step is to estimate the human behavior policy and the state-action value function via maximum likelihood estimation (MLE); the second step recovers the human reward function via minimizing Bellman mean squared error using the learned value functions; the third step is to plug in the learned reward and invoke pessimistic value iteration for finding a near-optimal policy. With only single-policy coverage (i.e., optimal policy) of the dataset, we prove that the suboptimality of DCPPO almost matches the classical pessimistic offline RL algorithm in terms of suboptimality's dependency on distribution shift and dimension. To the best of our knowledge, this paper presents the first theoretical guarantees for off-policy offline RLHF with dynamic discrete choice model.
翻译:本文研究离线人类反馈强化学习(RLHF),旨在从一组由人类选择生成的轨迹中学习人类潜在奖励函数及马尔可夫决策过程的最优策略。RLHF面临多重挑战:状态空间庞大而人类反馈有限、人类决策的有限理性以及离策略分布偏移。本文聚焦于动态离散选择(DDC)模型对人类选择进行建模与理解。DDC模型根植于计量经济学与决策理论,广泛用于模拟具有前瞻性与有限理性的人类决策过程。我们提出一种**动态选择悲观策略优化**(DCPPO)方法。该方法包含三阶段流程:第一步通过最大似然估计(MLE)估计人类行为策略与状态-动作价值函数;第二步利用学习到的价值函数,通过最小化贝尔曼均方误差恢复人类奖励函数;第三步将学习到的奖励函数代入,并调用悲观价值迭代寻找近优策略。在数据集仅满足单策略覆盖(即最优策略)的条件下,我们证明DCPPO的次优性在分布偏移与维度依赖性方面几乎匹配经典悲观离线RL算法。据我们所知,本文首次给出了基于动态离散选择模型的离策略离线RLHF的理论保证。