We analyze the Bayesian regret of the Gaussian process posterior sampling reinforcement learning (GP-PSRL) algorithm. Posterior sampling is an effective heuristic for decision-making under uncertainty that has been used to develop successful algorithms for a variety of continuous control problems. However, theoretical work on GP-PSRL is limited. All known regret bounds either fail to achieve a tight dependence on a kernel-dependent quantity called the maximum information gain or fail to properly account for the fact that the set of possible system states is unbounded. Through a recursive application of the Borell-Tsirelson-Ibragimov-Sudakov inequality, we show that, with high probability, the states actually visited by the algorithm are contained within a ball of near-constant radius. To obtain tight dependence on the maximum information gain, we use the chaining method to control the regret suffered by GP-PSRL. Our main result is a Bayesian regret bound of the order $\widetilde{\mathcal{O}}(H^{3/2}\sqrt{γ_{T/H} T})$, where $H$ is the horizon, $T$ is the number of time steps and $γ_{T/H}$ is the maximum information gain. With this result, we resolve the limitations with prior theoretical work on PSRL, and provide the theoretical foundation and tools for analyzing PSRL in complex settings.
翻译:我们分析了高斯过程后验采样强化学习(GP-PSRL)算法的贝叶斯遗憾。后验采样是一种在不确定性下进行决策的有效启发式方法,已被用于开发针对各种连续控制问题的成功算法。然而,关于GP-PSRL的理论研究有限。所有已知的遗憾界要么未能实现对称为最大信息增益的核相关量的紧密依赖,要么未能恰当考虑系统可能状态集合无界这一事实。通过对Borell-Tsirelson-Ibragimov-Sudakov不等式的递归应用,我们证明算法实际访问的状态以高概率包含在一个半径近乎恒定的球内。为获得对最大信息增益的紧密依赖,我们使用链式方法控制GP-PSRL所遭受的遗憾。我们的主要结果是阶为$\widetilde{\mathcal{O}}(H^{3/2}\sqrt{γ_{T/H} T})$的贝叶斯遗憾界,其中$H$为时间跨度,$T$为时间步数,$γ_{T/H}$为最大信息增益。通过这一结果,我们解决了先前关于PSRL理论研究的局限性,并为分析复杂场景下的PSRL提供了理论基础和工具。