Policy Mirror Descent (PMD) is a general family of algorithms that covers a wide range of novel and fundamental methods in reinforcement learning. Motivated by the instability of policy iteration (PI) with inexact policy evaluation, unregularised PMD algorithmically regularises the policy improvement step of PI without regularising the objective function. With exact policy evaluation, PI is known to converge linearly with a rate given by the discount factor $\gamma$ of a Markov Decision Process. In this work, we bridge the gap between PI and PMD with exact policy evaluation and show that the dimension-free $\gamma$-rate of PI can be achieved by the general family of unregularised PMD algorithms under an adaptive step-size. We show that both the rate and step-size are unimprovable for PMD: we provide matching lower bounds that demonstrate that the $\gamma$-rate is optimal for PMD methods as well as PI and that the adaptive step-size is necessary to achieve it. Our work is the first to relate PMD to rate-optimality and step-size necessity. Our study of the convergence of PMD avoids the use of the performance difference lemma, which leads to a direct analysis of independent interest. We also extend the analysis to the inexact setting and establish the first dimension-optimal sample complexity for unregularised PMD under a generative model, improving upon the best-known result.
翻译:策略镜像下降(PMD)是一类通用算法族,涵盖了强化学习中多种新颖且基础的方法。受非精确策略评估下策略迭代(PI)不稳定性问题的启发,未正则化PMD在算法层面隐式地对PI的策略改进步骤进行正则化,而不直接正则化目标函数。已知在精确策略评估条件下,PI以马尔可夫决策过程折扣因子γ决定的速率线性收敛。本文弥合了精确策略评估下PI与PMD之间的理论鸿沟,证明在自适应步长策略下,未正则化PMD算法族能够实现PI中与维度无关的γ-收敛速率。我们进一步证明该速率与步长策略对PMD而言均不可改进:通过建立匹配的下界,证实γ-速率对PMD方法与PI同样最优,且自适应步长是实现该速率的必要条件。本研究首次将PMD与速率最优性及步长必要性建立关联。我们避免使用性能差异引理来分析PMD收敛性,这种直接分析方法具有独立研究价值。此外,我们将分析扩展至非精确设置,在生成模型下建立了未正则化PMD的首个维度最优样本复杂度,改进了现有最优结果。