A wide variety of queueing systems can be naturally modeled as infinite-state Markov Decision Processes (MDPs). In the reinforcement learning (RL) context, a variety of algorithms have been developed to learn and optimize these MDPs. At the heart of many popular policy-gradient based learning algorithms, such as natural actor-critic, TRPO, and PPO, lies the Natural Policy Gradient (NPG) policy optimization algorithm. Convergence results for these RL algorithms rest on convergence results for the NPG algorithm. However, all existing results on the convergence of the NPG algorithm are limited to finite-state settings. We study a general class of queueing MDPs, and prove a $O(1/\sqrt{T})$ convergence rate for the NPG algorithm, if the NPG algorithm is initialized with the MaxWeight policy. This is the first convergence rate bound for the NPG algorithm for a general class of infinite-state average-reward MDPs. Moreover, our result applies to a beyond the queueing setting to any countably-infinite MDP satisfying certain mild structural assumptions, given a sufficiently good initial policy. Key to our result are state-dependent bounds on the relative value function achieved by the iterate policies of the NPG algorithm.
翻译:大量排队系统可以自然地建模为无限状态马尔可夫决策过程。在强化学习背景下,已开发出多种算法来学习和优化这些MDP。许多流行的基于策略梯度的学习算法,如自然行动者-评论者、TRPO和PPO,其核心是自然策略梯度策略优化算法。这些强化学习算法的收敛性结果依赖于NPG算法的收敛性结果。然而,现有关于NPG算法收敛性的所有结果均局限于有限状态场景。我们研究了一类广泛的排队MDP,并证明了若NPG算法以MaxWeight策略初始化,则其具有$O(1/\sqrt{T})$的收敛速率。这是针对一类广泛无限状态平均奖励MDP的NPG算法的首个收敛速率界。此外,只要给定足够好的初始策略,我们的结果可推广至超出排队场景的任何满足特定温和结构假设的可数无限MDP。我们结果的关键在于对NPG算法迭代策略所实现的相对价值函数给出了状态相关的界。