Efficient Exploration in Deep Reinforcement Learning: A Novel Bayesian Actor-Critic Algorithm

Reinforcement learning (RL) and Deep Reinforcement Learning (DRL), in particular, have the potential to disrupt and are already changing the way we interact with the world. One of the key indicators of their applicability is their ability to scale and work in real-world scenarios, that is in large-scale problems. This scale can be achieved via a combination of factors, the algorithm's ability to make use of large amounts of data and computational resources and the efficient exploration of the environment for viable solutions (i.e. policies). In this work, we investigate and motivate some theoretical foundations for deep reinforcement learning. We start with exact dynamic programming and work our way up to stochastic approximations and stochastic approximations for a model-free scenario, which forms the theoretical basis of modern reinforcement learning. We present an overview of this highly varied and rapidly changing field from the perspective of Approximate Dynamic Programming. We then focus our study on the short-comings with respect to exploration of the cornerstone approaches (i.e. DQN, DDQN, A2C) in deep reinforcement learning. On the theory side, our main contribution is the proposal of a novel Bayesian actor-critic algorithm. On the empirical side, we evaluate Bayesian exploration as well as actor-critic algorithms on standard benchmarks as well as state-of-the-art evaluation suites and show the benefits of both of these approaches over current state-of-the-art deep RL methods. We release all the implementations and provide a full python library that is easy to install and hopefully will serve the reinforcement learning community in a meaningful way, and provide a strong foundation for future work.

翻译：强化学习（RL），尤其是深度强化学习（DRL），具有颠覆性潜力，并且已经在改变我们与世界互动的方式。其适用性的一个关键指标在于其扩展能力以及在现实世界场景（即大规模问题）中工作的能力。这种规模可以通过多种因素的结合来实现，包括算法利用大量数据和计算资源的能力，以及对环境中可行解决方案（即策略）的高效探索。在本工作中，我们研究并阐述了深度强化学习的一些理论基础。我们从精确动态规划开始，逐步推进到随机近似以及无模型场景下的随机近似，这构成了现代强化学习的理论基础。我们从近似动态规划的角度，概述了这个高度多样且快速发展的领域。随后，我们将研究重点放在深度强化学习中几种基石性方法（即DQN、DDQN、A2C）在探索方面的不足。在理论方面，我们的主要贡献是提出了一种新颖的贝叶斯演员-评论家算法。在实证方面，我们在标准基准测试以及最先进的评估套件上评估了贝叶斯探索以及演员-评论家算法，并展示了这两种方法相对于当前最先进的深度RL方法的优势。我们发布了所有实现，并提供了一个易于安装的完整Python库，希望它能以有意义的方式服务于强化学习社区，并为未来的工作奠定坚实的基础。