A popular perspective in Reinforcement learning (RL) casts the problem as probabilistic inference on a graphical model of the Markov decision process (MDP). The core object of study is the probability of each state-action pair being visited under the optimal policy. Previous approaches to approximate this quantity can be arbitrarily poor, leading to algorithms that do not implement genuine statistical inference and consequently do not perform well in challenging problems. In this work, we undertake a rigorous Bayesian treatment of the posterior probability of state-action optimality and clarify how it flows through the MDP. We first reveal that this quantity can indeed be used to generate a policy that explores efficiently, as measured by regret. Unfortunately, computing it is intractable, so we derive a new variational Bayesian approximation yielding a tractable convex optimization problem and establish that the resulting policy also explores efficiently. We call our approach VAPOR and show that it has strong connections to Thompson sampling, K-learning, and maximum entropy exploration. We conclude with some experiments demonstrating the performance advantage of a deep RL version of VAPOR.
翻译:强化学习(RL)中的一种流行观点将问题视为马尔可夫决策过程(MDP)图模型上的概率推理。研究的核心对象是每个状态-动作对在最优策略下被访问的概率。先前近似该量值的方法可能效果极差,导致算法无法实现真正的统计推断,因此在具有挑战性的问题上表现不佳。本文对状态-动作最优性的后验概率进行了严格的贝叶斯处理,并阐明了它如何在MDP中流动。我们首先揭示该量值确实可用于生成高效探索的策略(以遗憾值衡量)。不幸的是,计算该值在计算上难以实现,因此我们推导出一种新的变分贝叶斯近似方法,得到一个可解的凸优化问题,并证明所得策略也能高效探索。我们将该方法命名为VAPOR,并表明它与汤普森采样、K-学习和最大熵探索密切相关。最后,我们通过实验展示了VAPOR深度强化学习版本的性能优势。