Posterior Sampling Reinforcement Learning with Gaussian Processes for Continuous Control: Sublinear Regret Bounds for Unbounded State Spaces

We analyze the Bayesian regret of the Gaussian process posterior sampling reinforcement learning (GP-PSRL) algorithm. Posterior sampling is an effective heuristic for decision-making under uncertainty that has been used to develop successful algorithms for a variety of continuous control problems. However, theoretical work on GP-PSRL is limited. All known regret bounds either fail to achieve a tight dependence on a kernel-dependent quantity called the maximum information gain or fail to properly account for the fact that the set of possible system states is unbounded. Through a recursive application of the Borell-Tsirelson-Ibragimov-Sudakov inequality, we show that, with high probability, the states actually visited by the algorithm are contained within a ball of near-constant radius. To obtain tight dependence on the maximum information gain, we use the chaining method to control the regret suffered by GP-PSRL. Our main result is a Bayesian regret bound of the order $\widetilde{\mathcal{O}}(H^{3/2}\sqrt{γ_{T/H} T})$, where $H$ is the horizon, $T$ is the number of time steps and $γ_{T/H}$ is the maximum information gain. With this result, we resolve the limitations with prior theoretical work on PSRL, and provide the theoretical foundation and tools for analyzing PSRL in complex settings.

翻译：我们分析了高斯过程后验采样强化学习（GP-PSRL）算法的贝叶斯遗憾。后验采样是一种在不确定性下进行决策的有效启发式方法，已被用于开发针对各种连续控制问题的成功算法。然而，关于GP-PSRL的理论研究有限。所有已知的遗憾界要么未能实现对称为最大信息增益的核相关量的紧密依赖，要么未能恰当考虑系统可能状态集合无界这一事实。通过对Borell-Tsirelson-Ibragimov-Sudakov不等式的递归应用，我们证明算法实际访问的状态以高概率包含在一个半径近乎恒定的球内。为获得对最大信息增益的紧密依赖，我们使用链式方法控制GP-PSRL所遭受的遗憾。我们的主要结果是阶为$\widetilde{\mathcal{O}}(H^{3/2}\sqrt{γ_{T/H} T})$的贝叶斯遗憾界，其中$H$为时间跨度，$T$为时间步数，$γ_{T/H}$为最大信息增益。通过这一结果，我们解决了先前关于PSRL理论研究的局限性，并为分析复杂场景下的PSRL提供了理论基础和工具。

相关内容

信息增益

关注 0

信息增益（Kullback–Leibler divergence）又叫做information divergence，relative entropy 或者KLIC。在概率论和信息论中，信息增益是非对称的，用以度量两种概率分布P和Q的差异。信息增益描述了当使用Q进行编码时，再使用P进行编码的差异。通常P代表样本或观察值的分布，也有可能是精确计算的理论分布。Q代表一种理论，模型，描述或者对P的近似。

MIT科学家Dimitri P. Bertsekas最新《强化学习与最优控制》2024ASU课程，(附书稿PDF&讲义)

专知会员服务

60+阅读 · 2024年1月25日

【普林斯顿博士论文】高维强化学习与最优控制问题，121页pdf

专知会员服务

50+阅读 · 2023年7月25日

长综述《用于随机控制和博弈的机器学习方法最新发展》2022最新76页长论文，加州大学、上海纽约大学等

专知会员服务

47+阅读 · 2022年9月29日

【干货书】使用高斯过程模型的动态系统建模与控制，281页pdf

专知会员服务

55+阅读 · 2022年5月23日