We study a primal-dual (PD) reinforcement learning (RL) algorithm for online constrained Markov decision processes (CMDPs). Despite its widespread practical use, the existing theoretical literature on PD-RL algorithms for this problem only provides sublinear regret guarantees and fails to ensure convergence to optimal policies. In this paper, we introduce a novel policy gradient PD algorithm with uniform probably approximate correctness (Uniform-PAC) guarantees, simultaneously ensuring convergence to optimal policies, sublinear regret, and polynomial sample complexity for any target accuracy. Notably, this represents the first Uniform-PAC algorithm for the online CMDP problem. In addition to the theoretical guarantees, we empirically demonstrate in a simple CMDP that our algorithm converges to optimal policies, while baseline algorithms exhibit oscillatory performance and constraint violation.
翻译:本文研究了一种用于在线约束马尔可夫决策过程的对偶强化学习算法。尽管该算法在实践中被广泛使用,但现有关于此问题的对偶强化学习算法的理论文献仅提供次线性遗憾保证,且无法确保收敛至最优策略。本文提出了一种具有均匀概率近似正确性保证的新型策略梯度对偶算法,该算法同时确保收敛至最优策略、次线性遗憾以及对任意目标精度的多项式样本复杂度。值得注意的是,这是首个针对在线约束马尔可夫决策过程问题的均匀概率近似正确性算法。除理论保证外,我们通过简单约束马尔可夫决策过程的实验证明,所提算法能收敛至最优策略,而基线算法表现出振荡性能及约束违反现象。