We consider a constrained Markov Decision Problem (CMDP) where the goal of an agent is to maximize the expected discounted sum of rewards over an infinite horizon while ensuring that the expected discounted sum of costs exceeds a certain threshold. Building on the idea of momentum-based acceleration, we develop the Primal-Dual Accelerated Natural Policy Gradient (PD-ANPG) algorithm that guarantees an $\epsilon$ global optimality gap and $\epsilon$ constraint violation with $\tilde{\mathcal{O}}(\epsilon^{-2})$ sample complexity for general parameterized policies. This improves the state-of-the-art sample complexity in general parameterized CMDPs by a factor of $\mathcal{O}(\epsilon^{-2})$ and achieves the theoretical lower bound.
翻译:本文研究约束马尔可夫决策问题(CMDP),其目标是在无限时域内最大化期望折扣奖励总和,同时确保期望折扣成本总和不超过特定阈值。基于动量加速思想,我们提出了原始-对偶加速自然策略梯度(PD-ANPG)算法,该算法对通用参数化策略可保证 $\epsilon$ 全局最优性间隙与 $\epsilon$ 约束违反度,且具有 $\tilde{\mathcal{O}}(\epsilon^{-2})$ 的样本复杂度。该结果将通用参数化CMDP的现有最优样本复杂度改进了 $\mathcal{O}(\epsilon^{-2})$ 倍,并达到了理论下界。