We present a scalable and effective exploration strategy based on Thompson sampling for reinforcement learning (RL). One of the key shortcomings of existing Thompson sampling algorithms is the need to perform a Gaussian approximation of the posterior distribution, which is not a good surrogate in most practical settings. We instead directly sample the Q function from its posterior distribution, by using Langevin Monte Carlo, an efficient type of Markov Chain Monte Carlo (MCMC) method. Our method only needs to perform noisy gradient descent updates to learn the exact posterior distribution of the Q function, which makes our approach easy to deploy in deep RL. We provide a rigorous theoretical analysis for the proposed method and demonstrate that, in the linear Markov decision process (linear MDP) setting, it has a regret bound of $\tilde{O}(d^{3/2}H^{5/2}\sqrt{T})$, where $d$ is the dimension of the feature mapping, $H$ is the planning horizon, and $T$ is the total number of steps. We apply this approach to deep RL, by using Adam optimizer to perform gradient updates. Our approach achieves better or similar results compared with state-of-the-art deep RL algorithms on several challenging exploration tasks from the Atari57 suite.
翻译:我们提出了一种基于汤普森采样的可扩展且有效的强化学习(RL)探索策略。现有汤普森采样算法的一个关键缺陷是需要对后验分布进行高斯近似,而在大多数实际场景中,这种近似并非良好的替代方案。我们转而采用朗之万蒙特卡洛(一种高效的马尔可夫链蒙特卡洛(MCMC)方法)直接从Q函数的后验分布中进行采样。该方法仅需执行带噪声的梯度下降更新即可学习Q函数的精确后验分布,因此易于在深度强化学习中部署。我们为所提方法提供了严格的理论分析,并证明在线性马尔可夫决策过程(线性MDP)设定下,其遗憾界为$\tilde{O}(d^{3/2}H^{5/2}\sqrt{T})$,其中$d$为特征映射维度,$H$为规划视界,$T$为总步数。我们将该方法应用于深度强化学习,采用Adam优化器执行梯度更新。在Atari57基准套件的多个具有挑战性的探索任务中,我们的方法与最先进的深度强化学习算法相比,取得了更优或相当的结果。