In this paper, we prove the first Bayesian regret bounds for Thompson Sampling in reinforcement learning in a multitude of settings. We simplify the learning problem using a discrete set of surrogate environments, and present a refined analysis of the information ratio using posterior consistency. This leads to an upper bound of order $\widetilde{O}(H\sqrt{d_{l_1}T})$ in the time inhomogeneous reinforcement learning problem where $H$ is the episode length and $d_{l_1}$ is the Kolmogorov $l_1-$dimension of the space of environments. We then find concrete bounds of $d_{l_1}$ in a variety of settings, such as tabular, linear and finite mixtures, and discuss how how our results are either the first of their kind or improve the state-of-the-art.
翻译:本文证明了在多种强化学习场景下,汤普森采样的首个贝叶斯遗憾界。我们利用一组离散的替代环境简化了学习问题,并通过后验一致性对信息比进行了精细分析。这导致在时间非齐次强化学习问题中得到了阶为 $\widetilde{O}(H\sqrt{d_{l_1}T})$ 的上界,其中 $H$ 为回合长度,$d_{l_1}$ 为环境空间的柯尔莫哥洛夫 $l_1$ 维度。随后,我们在表格、线性及有限混合等多种具体场景中推导了 $d_{l_1}$ 的显式界,并讨论了这些结果如何或为同类首创,或改进了现有最优水平。