In this paper, we study the problem of efficient online reinforcement learning in the infinite horizon setting when there is an offline dataset to start with. We assume that the offline dataset is generated by an expert but with unknown level of competence, i.e., it is not perfect and not necessarily using the optimal policy. We show that if the learning agent models the behavioral policy (parameterized by a competence parameter) used by the expert, it can do substantially better in terms of minimizing cumulative regret, than if it doesn't do that. We establish an upper bound on regret of the exact informed PSRL algorithm that scales as $\tilde{O}(\sqrt{T})$. This requires a novel prior-dependent regret analysis of Bayesian online learning algorithms for the infinite horizon setting. We then propose the Informed RLSVI algorithm to efficiently approximate the iPSRL algorithm.
翻译:本文研究了在存在离线数据集的情况下,无限时域在线强化学习的高效性问题。我们假设离线数据集由一位专家生成,但专家的能力水平未知,即该数据集并非完美且不一定采用最优策略。研究表明,若学习智能体对专家所使用的行为策略(以能力参数化)进行建模,相比不建模的情况,能在累计遗憾最小化方面取得显著提升。我们给出了精确知情PSRL算法的遗憾上界,其规模为$\tilde{O}(\sqrt{T})$。这需要对无限时域场景下的贝叶斯在线学习算法进行新颖的先验依赖遗憾分析。随后,我们提出知情RLSVI算法以高效逼近iPSRL算法。