Natural policy gradient (NPG) methods with entropy regularization achieve impressive empirical success in reinforcement learning problems with large state-action spaces. However, their convergence properties and the impact of entropy regularization remain elusive in the function approximation regime. In this paper, we establish finite-time convergence analyses of entropy-regularized NPG with linear function approximation under softmax parameterization. In particular, we prove that entropy-regularized NPG with averaging satisfies the \emph{persistence of excitation} condition, and achieves a fast convergence rate of $\tilde{O}(1/T)$ up to a function approximation error in regularized Markov decision processes. This convergence result does not require any a priori assumptions on the policies. Furthermore, under mild regularity conditions on the concentrability coefficient and basis vectors, we prove that entropy-regularized NPG exhibits \emph{linear convergence} up to a function approximation error.
翻译:自然策略梯度方法结合熵正则化在解决具有大状态-动作空间的强化学习问题中取得了令人瞩目的实证成功。然而,在函数逼近框架下,其收敛性质及熵正则化的影响仍不明确。本文针对采用softmax参数化的线性函数逼近,建立了熵正则化自然策略梯度的有限时间收敛分析。特别地,我们证明带有平均操作的熵正则化自然策略梯度满足"持续激励"条件,并在正则化马尔可夫决策过程中以$\tilde{O}(1/T)$的快速收敛速率逼近函数逼近误差。该收敛结果无需对策略做出任何先验假设。此外,在关于集中系数与基向量的温和正则性条件下,我们证明熵正则化自然策略梯度展现出直到函数逼近误差的"线性收敛"特性。