Policy gradient methods are widely used in reinforcement learning. Yet, the nonconvexity of policy optimization imposes significant challenges in understanding the global convergence of policy gradient methods. For a class of finite-horizon Markov Decision Processes (MDPs) with general state and action spaces, we develop a framework that provides a set of easily verifiable assumptions to ensure the Kurdyka-Lojasiewicz (KL) condition of the policy optimization. Leveraging the KL condition, policy gradient methods converge to the globally optimal policy with a non-asymptomatic rate despite nonconvexity. Our results find applications in various control and operations models, including entropy-regularized tabular MDPs, Linear Quadratic Regulator (LQR) problems, stochastic inventory models, and stochastic cash balance problems, for which we show an $\epsilon$-optimal policy can be obtained using a sample size in $\tilde{\mathcal{O}}(\epsilon^{-1})$ and polynomial in terms of the planning horizon by stochastic policy gradient methods. Our result establishes the first sample complexity for multi-period inventory systems with Markov-modulated demands and stochastic cash balance problems in the literature.
翻译:策略梯度方法在强化学习中应用广泛。然而,策略优化的非凸性为理解策略梯度方法的全局收敛性带来了重大挑战。针对一类具有一般状态与动作空间的有限时域马尔可夫决策过程(MDP),我们建立了一个框架,提供了一组易于验证的假设以确保策略优化满足Kurdyka-Lojasiewicz(KL)条件。借助KL条件,尽管存在非凸性,策略梯度方法仍能以非渐近速率收敛至全局最优策略。我们的研究成果可应用于多种控制与运营模型,包括熵正则化表格MDP、线性二次调节器(LQR)问题、随机库存模型以及随机现金余额问题。对于这些模型,我们证明了通过随机策略梯度方法,能以$\tilde{\mathcal{O}}(\epsilon^{-1})$的样本复杂度(关于规划时域呈多项式关系)获得$\epsilon$最优策略。我们的研究首次在文献中建立了具有马尔可夫调制需求的周期库存系统及随机现金余额问题的样本复杂度理论。