We consider the problem of offline reinforcement learning from human feedback (RLHF) with pairwise comparisons proposed by Zhu et al. (2023), where the implicit reward is a linear function of an unknown parameter. Given an offline dataset, our objective consists in ascertaining the optimal action for each state, with the ultimate goal of minimizing the {\em simple regret}. We propose an algorithm, \underline{RL} with \underline{L}ocally \underline{O}ptimal \underline{W}eights or {\sc RL-LOW}, which yields an exponential form of simple regret of $\exp ( - Ω(n/H) )$ where $n$ is the number of data samples and $H$ denotes an instance-dependent hardness quantity that depends explicitly on the suboptimality gap of each action. Furthermore, we derive a first-of-its-kind instance-dependent lower bound in offline RLHF with pairwise comparisons. Interestingly, we observe that the lower and upper bounds on the simple regret match order-wise in the exponent, demonstrating order-wise optimality of our {\sc RL-LOW}. In view of privacy considerations in practical applications, we also extend {\sc RL-LOW} to the setting of $(\varepsilon,δ)$-differential privacy and show, somewhat surprisingly, that the hardness parameter $H$ is unchanged in the asymptotic regime as $n$ tends to infinity; this underscores the inherent efficiency of {\sc RL-LOW} in terms of preserving the privacy of the observed rewards. Given our focus on establishing instance-dependent bounds of exponential convergence, our research fills the research gap in existing studies that concentrate on establishing worst-case regrets of {\em inverse polynomial convergence} (e.g., $\widetilde{O}(\frac{1}{\sqrt{n}})$) for offline RLHF with pairwise comparisons.
翻译:我们研究Zhu等人(2023)提出的基于成对比较的离线人类反馈强化学习(RLHF)问题,其中隐式奖励是未知参数的线性函数。给定离线数据集,我们的目标在于确定每个状态下的最优动作,最终目标是最小化{\em 简单遗憾}。我们提出一种算法——\underline{RL} with \underline{L}ocally \underline{O}ptimal \underline{W}eights({\sc RL-LOW}),该算法可产生$\exp ( - Ω(n/H) )$形式的指数级简单遗憾,其中$n$为数据样本数,$H$表示与实例相关的硬度参数,该参数显式依赖于每个动作的次优性差距。此外,我们首次推导出基于成对比较的离线RLHF中实例相关的下界。值得注意的是,我们观察到简单遗憾的上界与下界在指数阶上匹配,这证明了{\sc RL-LOW}具有阶数最优性。鉴于实际应用中的隐私考量,我们还将{\sc RL-LOW}扩展至$(\varepsilon,δ)$-差分隐私设置,并出人意料地发现:当$n$趋于无穷时,硬度参数$H$在渐近区域内保持不变;这印证了{\sc RL-LOW}在保护观测奖励隐私方面的内在高效性。由于本研究聚焦于建立指数收敛的实例相关界,我们的工作填补了现有研究的空白——现有研究主要关注为基于成对比较的离线RLHF建立{\em 逆多项式收敛}的最坏情况遗憾界(例如$\widetilde{O}(\frac{1}{\sqrt{n}})$)。