Incorporating expert demonstrations has empirically helped to improve the sample efficiency of reinforcement learning (RL). This paper quantifies theoretically to what extent this extra information reduces RL's sample complexity. In particular, we study the demonstration-regularized reinforcement learning that leverages the expert demonstrations by KL-regularization for a policy learned by behavior cloning. Our findings reveal that using $N^{\mathrm{E}}$ expert demonstrations enables the identification of an optimal policy at a sample complexity of order $\widetilde{\mathcal{O}}(\mathrm{Poly}(S,A,H)/(\varepsilon^2 N^{\mathrm{E}}))$ in finite and $\widetilde{\mathcal{O}}(\mathrm{Poly}(d,H)/(\varepsilon^2 N^{\mathrm{E}}))$ in linear Markov decision processes, where $\varepsilon$ is the target precision, $H$ the horizon, $A$ the number of action, $S$ the number of states in the finite case and $d$ the dimension of the feature space in the linear case. As a by-product, we provide tight convergence guarantees for the behaviour cloning procedure under general assumptions on the policy classes. Additionally, we establish that demonstration-regularized methods are provably efficient for reinforcement learning from human feedback (RLHF). In this respect, we provide theoretical evidence showing the benefits of KL-regularization for RLHF in tabular and linear MDPs. Interestingly, we avoid pessimism injection by employing computationally feasible regularization to handle reward estimation uncertainty, thus setting our approach apart from the prior works.
翻译:融入专家演示在经验上有助于提升强化学习的样本效率。本文从理论上量化了这种额外信息降低强化学习样本复杂度的程度。具体而言,我们研究了通过KL正则化利用行为克隆所学策略的专家演示的演示正则化强化学习。研究发现,使用Nᵋ次专家演示能够在有限马尔可夫决策过程中以样本复杂度阶数$\widetilde{\mathcal{O}}(\mathrm{Poly}(S,A,H)/(\varepsilon^2 N^{\mathrm{E}}))$,在线性马尔可夫决策过程中以$\widetilde{\mathcal{O}}(\mathrm{Poly}(d,H)/(\varepsilon^2 N^{\mathrm{E}}))$识别最优策略,其中$\varepsilon$为目标精度,$H$为视野范围,$A$为动作数量,$S$为有限情形中的状态数量,$d$为线性情形中特征空间的维度。作为副产品,我们在策略类别的一般性假设下为行为克隆过程提供了紧的收敛性保证。此外,我们证明了演示正则化方法在从人类反馈强化学习中具有可证明的高效性。在此方面,我们提供了理论证据表明在表格型和线性马尔可夫决策过程中,KL正则化对RLHF的益处。值得注意的是,我们通过采用计算可行的正则化处理奖励估计不确定性,避免了悲观性注入,从而将我们的方法与先前工作区分开来。