We consider infinite-horizon discounted Markov decision processes and study the convergence rates of the natural policy gradient (NPG) and the Q-NPG methods with the log-linear policy class. Using the compatible function approximation framework, both methods with log-linear policies can be written as inexact versions of the policy mirror descent (PMD) method. We show that both methods attain linear convergence rates and $\tilde{\mathcal{O}}(1/\epsilon^2)$ sample complexities using a simple, non-adaptive geometrically increasing step size, without resorting to entropy or other strongly convex regularization. Lastly, as a byproduct, we obtain sublinear convergence rates for both methods with arbitrary constant step size.
翻译:我们考虑无限时域折扣马尔可夫决策过程,并研究采用对数线性策略类时自然策略梯度(NPG)与Q-NPG方法的收敛速率。利用兼容函数近似框架,采用对数线性策略的两种方法均可视作策略镜像下降(PMD)方法的不精确版本。我们证明,在无需熵或其他强凸正则化的情况下,通过使用简单、非自适应的几何递增步长,这两种方法均可达到线性收敛速率及$\tilde{\mathcal{O}}(1/\epsilon^2)$的样本复杂度。最后,作为副产品,我们进一步获得了这两种方法在任意常数步长下的次线性收敛速率。