We consider offline reinforcement learning (RL) with preference feedback in which the implicit reward is a linear function of an unknown parameter. Given an offline dataset, our objective consists in ascertaining the optimal action for each state, with the ultimate goal of minimizing the {\em simple regret}. We propose an algorithm, \underline{RL} with \underline{L}ocally \underline{O}ptimal \underline{W}eights or {\sc RL-LOW}, which yields a simple regret of $\exp ( - \Omega(n/H) )$ where $n$ is the number of data samples and $H$ denotes an instance-dependent hardness quantity that depends explicitly on the suboptimality gap of each action. Furthermore, we derive a first-of-its-kind instance-dependent lower bound in offline RL with preference feedback. Interestingly, we observe that the lower and upper bounds on the simple regret match order-wise in the exponent, demonstrating order-wise optimality of {\sc RL-LOW}. In view of privacy considerations in practical applications, we also extend {\sc RL-LOW} to the setting of $(\varepsilon,\delta)$-differential privacy and show, somewhat surprisingly, that the hardness parameter $H$ is unchanged in the asymptotic regime as $n$ tends to infinity; this underscores the inherent efficiency of {\sc RL-LOW} in terms of preserving the privacy of the observed rewards. Given our focus on establishing instance-dependent bounds, our work stands in stark contrast to previous works that focus on establishing worst-case regrets for offline RL with preference feedback.
翻译:我们研究基于偏好反馈的离线强化学习问题,其中隐式奖励是未知参数的线性函数。给定离线数据集,我们的目标在于确定每个状态下的最优动作,最终目标是最小化{\em 简单遗憾}。我们提出一种算法——\underline{RL} with \underline{L}ocally \underline{O}ptimal \underline{W}eights({\sc RL-LOW}),该算法可产生$\exp ( - \Omega(n/H) )$的简单遗憾,其中$n$为数据样本数,$H$表示实例相关的困难度量,该度量显式依赖于每个动作的次优性间隙。此外,我们推导出偏好反馈离线强化学习中首个实例相关的下界。值得注意的是,我们观察到简单遗憾的上界与下界在指数阶上匹配,这证明了{\sc RL-LOW}的阶数最优性。鉴于实际应用中的隐私考量,我们还将{\sc RL-LOW}扩展至$(\varepsilon,\delta)$-差分隐私设置,并出人意料地证明:当$n$趋于无穷时,困难参数$H$在渐近区域内保持不变;这凸显了{\sc RL-LOW}在保护观测奖励隐私方面的内在高效性。由于我们的研究重点在于建立实例相关界,本工作与先前聚焦于建立偏好反馈离线强化学习最坏情况遗憾的研究形成了鲜明对比。