Despite remarkable successes, deep reinforcement learning algorithms remain sample inefficient: they require an enormous amount of trial and error to find good policies. Model-based algorithms promise sample efficiency by building an environment model that can be used for planning. Posterior Sampling for Reinforcement Learning is such a model-based algorithm that has attracted significant interest due to its performance in the tabular setting. This paper introduces Posterior Sampling for Deep Reinforcement Learning (PSDRL), the first truly scalable approximation of Posterior Sampling for Reinforcement Learning that retains its model-based essence. PSDRL combines efficient uncertainty quantification over latent state space models with a specially tailored continual planning algorithm based on value-function approximation. Extensive experiments on the Atari benchmark show that PSDRL significantly outperforms previous state-of-the-art attempts at scaling up posterior sampling while being competitive with a state-of-the-art (model-based) reinforcement learning method, both in sample efficiency and computational efficiency.
翻译:尽管深度强化学习取得了显著的成功,但其算法在样本效率上仍显不足:需要大量的试错才能找到优质策略。基于模型的算法通过构建可用于规划的环境模型,有望提升样本效率。强化学习中的后验采样正是这样一种基于模型的算法,因在表格型设置中的优异表现而备受关注。本文提出深度强化学习的后验采样(PSDRL),这是首个真正可扩展的后验采样近似方法,同时保留了其基于模型的本质。PSDRL将潜状态空间模型的高效不确定性量化与基于价值函数近似的定制化持续规划算法相结合。在Atari基准上的大量实验表明,PSDRL在样本效率与计算效率两方面均显著优于此前试图扩展后验采样的最新技术,并与当前最先进的(基于模型的)强化学习方法具有竞争力。