We consider a constrained Markov Decision Problem (CMDP) where the goal of an agent is to maximize the expected discounted sum of rewards over an infinite horizon while ensuring that the expected discounted sum of costs exceeds a certain threshold. Building on the idea of momentum-based acceleration, we develop the Primal-Dual Accelerated Natural Policy Gradient (PD-ANPG) algorithm that guarantees an $\epsilon$ global optimality gap and $\epsilon$ constraint violation with $\mathcal{O}(\epsilon^{-3})$ sample complexity. This improves the state-of-the-art sample complexity in CMDP by a factor of $\mathcal{O}(\epsilon^{-1})$.
翻译:我们考虑一个约束马尔可夫决策过程(CMDP),其中智能体的目标是在无限时域内最大化期望折扣累积奖励,同时确保期望折扣累积成本超过某一阈值。基于动量加速的思想,我们提出了原对偶加速自然策略梯度(PD-ANPG)算法,该算法以$\mathcal{O}(\epsilon^{-3})$的样本复杂度保证$\epsilon$全局最优性差距和$\epsilon$约束违反。这相比CMDP中当前最先进的样本复杂度提高了$\mathcal{O}(\epsilon^{-1})$倍。