Bayesian reinforcement learning (BRL) is a method that merges principles from Bayesian statistics and reinforcement learning to make optimal decisions in uncertain environments. Similar to other model-based RL approaches, it involves two key components: (1) Inferring the posterior distribution of the data generating process (DGP) modeling the true environment and (2) policy learning using the learned posterior. We propose to model the dynamics of the unknown environment through deep generative models assuming Markov dependence. In absence of likelihood functions for these models we train them by learning a generalized predictive-sequential (or prequential) scoring rule (SR) posterior. We use sequential Monte Carlo (SMC) samplers to draw samples from this generalized Bayesian posterior distribution. In conjunction, to achieve scalability in the high dimensional parameter space of the neural networks, we use the gradient based Markov chain Monte Carlo (MCMC) kernels within SMC. To justify the use of the prequential scoring rule posterior we prove a Bernstein-von Misses type theorem. For policy learning, we propose expected Thompson sampling (ETS) to learn the optimal policy by maximizing the expected value function with respect to the posterior distribution. This improves upon traditional Thompson sampling (TS) and its extensions which utilize only one sample drawn from the posterior distribution. This improvement is studied both theoretically and using simulation studies assuming discrete action and state-space. Finally we successfully extend our setup for a challenging problem with continuous action space without theoretical guarantees.
翻译:贝叶斯强化学习(BRL)是一种融合贝叶斯统计学与强化学习原理的方法,旨在不确定环境中做出最优决策。与其他基于模型的强化学习方法类似,它包含两个关键组成部分:(1)推断建模真实环境的数据生成过程(DGP)的后验分布;(2)利用习得的后验分布进行策略学习。我们提出通过假设马尔可夫依赖的深度生成模型来建模未知环境的动态特性。由于这些模型缺乏似然函数,我们通过学习广义预测序贯(或先验)评分规则(SR)后验来训练它们。我们使用序贯蒙特卡洛(SMC)采样器从该广义贝叶斯后验分布中抽取样本。同时,为实现神经网络高维参数空间的可扩展性,我们在SMC中采用基于梯度的马尔可夫链蒙特卡洛(MCMC)核。为证明使用先验评分规则后验的合理性,我们证明了一个Bernstein-von Mises型定理。对于策略学习,我们提出期望汤普森采样(ETS),通过最大化关于后验分布的期望值函数来学习最优策略。这改进了仅利用从后验分布抽取单一样本的传统汤普森采样(TS)及其扩展方法。我们在假设离散动作与状态空间的前提下,从理论与仿真研究两方面分析了这一改进。最后,我们成功地将该框架扩展至一个具有连续动作空间的挑战性问题,尽管缺乏理论保证。