On the Exponential Convergence for Offline RLHF with Pairwise Comparisons

We consider the problem of offline reinforcement learning from human feedback (RLHF) with pairwise comparisons proposed by Zhu et al. (2023), where the implicit reward is a linear function of an unknown parameter. Given an offline dataset, our objective consists in ascertaining the optimal action for each state, with the ultimate goal of minimizing the {\em simple regret}. We propose an algorithm, \underline{RL} with \underline{L}ocally \underline{O}ptimal \underline{W}eights or {\sc RL-LOW}, which yields an exponential form of simple regret of $\exp ( - Ω(n/H) )$ where $n$ is the number of data samples and $H$ denotes an instance-dependent hardness quantity that depends explicitly on the suboptimality gap of each action. Furthermore, we derive a first-of-its-kind instance-dependent lower bound in offline RLHF with pairwise comparisons. Interestingly, we observe that the lower and upper bounds on the simple regret match order-wise in the exponent, demonstrating order-wise optimality of our {\sc RL-LOW}. In view of privacy considerations in practical applications, we also extend {\sc RL-LOW} to the setting of $(\varepsilon,δ)$-differential privacy and show, somewhat surprisingly, that the hardness parameter $H$ is unchanged in the asymptotic regime as $n$ tends to infinity; this underscores the inherent efficiency of {\sc RL-LOW} in terms of preserving the privacy of the observed rewards. Given our focus on establishing instance-dependent bounds of exponential convergence, our research fills the research gap in existing studies that concentrate on establishing worst-case regrets of {\em inverse polynomial convergence} (e.g., $\widetilde{O}(\frac{1}{\sqrt{n}})$) for offline RLHF with pairwise comparisons.

翻译：我们研究Zhu等人(2023)提出的基于成对比较的离线人类反馈强化学习(RLHF)问题，其中隐式奖励是未知参数的线性函数。给定离线数据集，我们的目标在于确定每个状态下的最优动作，最终目标是最小化{\em 简单遗憾}。我们提出一种算法——\underline{RL} with \underline{L}ocally \underline{O}ptimal \underline{W}eights（{\sc RL-LOW}），该算法可产生$\exp ( - Ω(n/H) )$形式的指数级简单遗憾，其中$n$为数据样本数，$H$表示与实例相关的硬度参数，该参数显式依赖于每个动作的次优性差距。此外，我们首次推导出基于成对比较的离线RLHF中实例相关的下界。值得注意的是，我们观察到简单遗憾的上界与下界在指数阶上匹配，这证明了{\sc RL-LOW}具有阶数最优性。鉴于实际应用中的隐私考量，我们还将{\sc RL-LOW}扩展至$(\varepsilon,δ)$-差分隐私设置，并出人意料地发现：当$n$趋于无穷时，硬度参数$H$在渐近区域内保持不变；这印证了{\sc RL-LOW}在保护观测奖励隐私方面的内在高效性。由于本研究聚焦于建立指数收敛的实例相关界，我们的工作填补了现有研究的空白——现有研究主要关注为基于成对比较的离线RLHF建立{\em 逆多项式收敛}的最坏情况遗憾界（例如$\widetilde{O}(\frac{1}{\sqrt{n}})$）。