The classical algorithms used in tabular reinforcement learning (Value Iteration and Policy Iteration) have been shown to converge linearly with a rate given by the discount factor $\gamma$ of a discounted Markov Decision Process. Recently, there has been an increased interest in the study of gradient based methods. In this work, we show that the dimension-free linear $\gamma$-rate of classical reinforcement learning algorithms can be achieved by a general family of unregularised Policy Mirror Descent (PMD) algorithms under an adaptive step-size. We also provide a matching worst-case lower-bound that demonstrates that the $\gamma$-rate is optimal for PMD methods. Our work offers a novel perspective on the convergence of PMD. We avoid the use of the performance difference lemma beyond establishing the monotonic improvement of the iterates, which leads to a simple analysis that may be of independent interest. We also extend our analysis to the inexact setting and establish the first dimension-free $\varepsilon$-optimal sample complexity for unregularised PMD under a generative model, improving upon the best-known result.
翻译:表格型强化学习中的经典算法(值迭代与策略迭代)已被证明以折扣马尔可夫决策过程的折扣因子$\gamma$为速率线性收敛。近年来,基于梯度的方法受到越来越多的关注。本文证明,在自适应步长下,一类通用的无正则化策略镜像下降(PMD)算法能够实现经典强化学习算法中与维度无关的线性$\gamma$速率。我们还给出了匹配的最坏情况下的下界,表明$\gamma$速率对于PMD方法是最优的。本文为PMD的收敛性提供了新视角。我们避免使用性能差异引理(除建立迭代的单调改进之外),从而得到一种可能具有独立意义的简洁分析方法。此外,我们将分析扩展至非精确设置,并在生成模型下首次建立了无正则化PMD的维度无关$\varepsilon$-最优样本复杂度,改进了现有最佳结果。