We study a primal-dual reinforcement learning (RL) algorithm for the online constrained Markov decision processes (CMDP) problem, wherein the agent explores an optimal policy that maximizes return while satisfying constraints. Despite its widespread practical use, the existing theoretical literature on primal-dual RL algorithms for this problem only provides sublinear regret guarantees and fails to ensure convergence to optimal policies. In this paper, we introduce a novel policy gradient primal-dual algorithm with uniform probably approximate correctness (Uniform-PAC) guarantees, simultaneously ensuring convergence to optimal policies, sublinear regret, and polynomial sample complexity for any target accuracy. Notably, this represents the first Uniform-PAC algorithm for the online CMDP problem. In addition to the theoretical guarantees, we empirically demonstrate in a simple CMDP that our algorithm converges to optimal policies, while an existing algorithm exhibits oscillatory performance and constraint violation.
翻译:我们研究了一种针对在线约束马尔可夫决策过程(CMDP)问题的原始-对偶强化学习(RL)算法,其中智能体探索一种在满足约束条件的同时最大化回报的最优策略。尽管该方法在实际应用中被广泛使用,但现有关于此类问题的原始-对偶RL算法理论文献仅提供次线性遗憾界,无法保证收敛到最优策略。本文提出了一种新颖的带均匀近似正确性(Uniform-PAC)保证的策略梯度原始-对偶算法,同时确保收敛到最优策略、达到次线性遗憾界以及对于任意目标精度具有多项式样本复杂度。值得注意的是,这是面向在线CMDP问题的首个Uniform-PAC算法。除理论保证外,我们还在一个简单CMDP中通过实验证明,所提算法能够收敛到最优策略,而现有算法则表现出振荡性能及约束违反现象。