In this paper, we study the offline and online settings of reinforcement learning from human feedback (RLHF) with KL-regularization -- a widely used objective function in large language model alignment -- under the $\epsilon$ local differential privacy ($\epsilon$-LDP) model on the label of the human preference. In the offline setting, we design an algorithm based on the principle of pessimism and derive a new suboptimality gap of $\tilde{O}(1/[(e^\epsilon-1)^2 n])$ on the KL-regularized objective under single-policy concentrability. We also prove its optimality by providing a matching lower bound where $n$ is the sample size. In the online setting, we are the first one to theoretically investigate the problem of KL-regularized RLHF with LDP. We design an optimism-based algorithm and derive a logarithmic regret bound of $O(d_{\mathcal{F}}\log (N_{\mathcal{F}}\cdot T) /(e^\epsilon-1)^2 )$, where $T$ is the total time step, $N_{\mathcal{F}}$ is cardinality of the reward function space $\mathcal{F}$ and $d_{\mathcal{F}}$ is a variant of eluder dimension for RLHF. As a by-product of our analysis, our results also imply the first analysis for online KL-regularized RLHF without privacy. We implement our algorithm in the offline setting to verify our theoretical results and release our open source code at: https://github.com/rushil-thareja/PPKL-RLHF-Official.
翻译:本文研究了在人类偏好标签上采用$\epsilon$局部差分隐私($\epsilon$-LDP)模型时,带KL正则化的人类反馈强化学习(RLHF)的离线与在线设定——KL正则化是大语言模型对齐中广泛使用的目标函数。在离线设定中,我们基于悲观原则设计了一种算法,并在单策略集中性条件下推导出KL正则化目标的新次优性界$\tilde{O}(1/[(e^\epsilon-1)^2 n])$。我们还通过给出匹配下界(其中$n$为样本量)证明了该界的最优性。在线设定中,我们首次从理论上研究了带LDP的KL正则化RLHF问题。我们设计了一种基于乐观原则的算法,并推导出对数遗憾界$O(d_{\mathcal{F}}\log (N_{\mathcal{F}}\cdot T) /(e^\epsilon-1)^2 )$,其中$T$为总时间步长,$N_{\mathcal{F}}$为奖励函数空间$\mathcal{F}$的基数,$d_{\mathcal{F}}$是RLHF中eluder维度的变体。作为我们分析的副产品,该结果也首次给出了无隐私保护的在线KL正则化RLHF的理论分析。我们在离线设定中实现了算法以验证理论结果,并将开源代码发布于:https://github.com/rushil-thareja/PPKL-RLHF-Official。