We present a scalable and effective exploration strategy based on Thompson sampling for reinforcement learning (RL). One of the key shortcomings of existing Thompson sampling algorithms is the need to perform a Gaussian approximation of the posterior distribution, which is not a good surrogate in most practical settings. We instead directly sample the Q function from its posterior distribution, by using Langevin Monte Carlo, an efficient type of Markov Chain Monte Carlo (MCMC) method. Our method only needs to perform noisy gradient descent updates to learn the exact posterior distribution of the Q function, which makes our approach easy to deploy in deep RL. We provide a rigorous theoretical analysis for the proposed method and demonstrate that, in the linear Markov decision process (linear MDP) setting, it has a regret bound of $\tilde{O}(d^{3/2}H^{3/2}\sqrt{T})$, where $d$ is the dimension of the feature mapping, $H$ is the planning horizon, and $T$ is the total number of steps. We apply this approach to deep RL, by using Adam optimizer to perform gradient updates. Our approach achieves better or similar results compared with state-of-the-art deep RL algorithms on several challenging exploration tasks from the Atari57 suite.
翻译:我们提出了一种基于汤普森采样的可扩展且有效的强化学习探索策略。现有汤普森采样算法的一个关键缺陷是需对后验分布进行高斯近似,这在多数实际场景中并非理想替代方案。我们转而采用朗之万蒙特卡洛这一高效的马尔可夫链蒙特卡洛方法,直接从Q函数后验分布中进行采样。该方法仅需执行含噪梯度下降更新即可学习Q函数的精确后验分布,使得算法易于部署于深度强化学习场景。我们为所提方法提供了严格的理论分析,证明在线性马尔可夫决策过程设定下,其遗憾界为$\tilde{O}(d^{3/2}H^{3/2}\sqrt{T})$,其中$d$为特征映射维度,$H$为规划时域长度,$T$为总步数。通过采用Adam优化器执行梯度更新,我们将该方法应用于深度强化学习。在Atari57基准套件的多个具有挑战性的探索任务中,我们的方法取得了与最先进深度强化学习算法相当或更优的结果。