In this paper, we prove the first Bayesian regret bounds for Thompson Sampling in reinforcement learning in a multitude of settings. We simplify the learning problem using a discrete set of surrogate environments, and present a refined analysis of the information ratio using posterior consistency. This leads to an upper bound of order $\widetilde{O}(H\sqrt{d_{l_1}T})$ in the time inhomogeneous reinforcement learning problem where $H$ is the episode length and $d_{l_1}$ is the Kolmogorov $l_1-$dimension of the space of environments. We then find concrete bounds of $d_{l_1}$ in a variety of settings, such as tabular, linear and finite mixtures, and discuss how how our results are either the first of their kind or improve the state-of-the-art.
翻译:本文证明了在多种强化学习环境下,汤普森采样首次获得的贝叶斯遗憾界。我们利用一组离散的替代环境简化了学习问题,并基于后验一致性提出了信息比的精细化分析。这在对时间非齐次强化学习问题中得到了$\widetilde{O}(H\sqrt{d_{l_1}T})$阶上界,其中$H$为回合长度,$d_{l_1}$为环境空间的Kolmogorov $l_1$维数。随后,我们在表格型、线性及有限混合等各类场景中推导出$d_{l_1}$的具体界值,并论证了本结果或属同类首创,或显著改进了现有最优结果。