We study a primal-dual reinforcement learning (RL) algorithm for the online constrained Markov decision processes (CMDP) problem, wherein the agent explores an optimal policy that maximizes return while satisfying constraints. Despite its widespread practical use, the existing theoretical literature on primal-dual RL algorithms for this problem only provides sublinear regret guarantees and fails to ensure convergence to optimal policies. In this paper, we introduce a novel policy gradient primal-dual algorithm with uniform probably approximate correctness (Uniform-PAC) guarantees, simultaneously ensuring convergence to optimal policies, sublinear regret, and polynomial sample complexity for any target accuracy. Notably, this represents the first Uniform-PAC algorithm for the online CMDP problem. In addition to the theoretical guarantees, we empirically demonstrate in a simple CMDP that our algorithm converges to optimal policies, while an existing algorithm exhibits oscillatory performance and constraint violation.
翻译:我们研究了一种用于在线约束马尔可夫决策过程(CMDP)问题的原始-对偶强化学习(RL)算法,其中智能体探索一种在满足约束的同时最大化回报的最优策略。尽管其在实际应用中被广泛使用,但现有针对该问题的原始-对偶RL算法理论文献仅提供了次线性遗憾界,未能确保收敛到最优策略。在本文中,我们引入了一种具有统一概率近似正确性(Uniform-PAC)保证的新型策略梯度原始-对偶算法,同时确保了对于任意目标精度,能收敛到最优策略、具有次线性遗憾界以及多项式样本复杂度。值得注意的是,这是针对在线CMDP问题的首个Uniform-PAC算法。除了理论保证外,我们还在一个简单的CMDP中通过实验证明,我们的算法收敛到了最优策略,而现有算法则表现出振荡性能及约束违反现象。