We study infinite-horizon average-reward reinforcement learning (RL) for Lipschitz MDPs and develop an algorithm PZRL that discretizes the state-action space adaptively and zooms in to promising regions of the "policy space" which seems to yield high average rewards. We show that the regret of PZRL can be bounded as $\tilde{\mathcal{O}}\big(T^{1 - d_{\text{eff.}}^{-1}}\big)$, where $d_{\text{eff.}}= 2d_\mathcal{S} + d^\Phi_z+2$, $d_\mathcal{S}$ is the dimension of the state space, and $d^\Phi_z$ is the zooming dimension. $d^\Phi_z$ is a problem-dependent quantity that depends not only on the underlying MDP but also the class of policies $\Phi$ used by the agent, which allows us to conclude that if the agent apriori knows that optimal policy belongs to a low-complexity class (that has small $d^\Phi_z$), then its regret will be small. The current work shows how to capture adaptivity gains for infinite-horizon average-reward RL in terms of $d^\Phi_z$. We note that the preexisting notions of zooming dimension are adept at handling only the episodic RL case since zooming dimension approaches covering dimension of state-action space as $T\to\infty$ and hence do not yield any possible adaptivity gains. Several experiments are conducted to evaluate the performance of PZRL. PZRL outperforms other state-of-the-art algorithms; this clearly demonstrates the gains arising due to adaptivity.
翻译:我们研究了利普希茨马尔可夫决策过程(MDP)的无限时域平均奖励强化学习(RL),并开发了一种算法PZRL。该算法自适应地离散化状态-动作空间,并“缩放”聚焦到那些似乎能产生高平均奖励的“策略空间”的潜力区域。我们证明,PZRL的遗憾可以被限定为 $\tilde{\mathcal{O}}\big(T^{1 - d_{\text{eff.}}^{-1}}\big)$,其中 $d_{\text{eff.}}= 2d_\mathcal{S} + d^\Phi_z+2$,$d_\mathcal{S}$ 是状态空间的维度,$d^\Phi_z$ 是缩放维度。$d^\Phi_z$ 是一个与问题相关的量,它不仅取决于底层的MDP,还取决于智能体所使用的策略类 $\Phi$。这使我们能够得出结论:如果智能体先验地知道最优策略属于一个低复杂度类(即具有较小的 $d^\Phi_z$),那么其遗憾将会很小。当前的工作展示了如何通过 $d^\Phi_z$ 来捕捉无限时域平均奖励RL中的自适应收益。我们注意到,现有的缩放维度概念仅擅长处理分幕式RL情况,因为随着 $T\to\infty$,缩放维度会趋近于状态-动作空间的覆盖维度,从而无法产生任何可能的自适应收益。我们进行了多项实验来评估PZRL的性能。PZRL的表现优于其他最先进的算法;这清楚地证明了自适应带来的收益。