We consider discounted infinite horizon constrained Markov decision processes (CMDPs) where the goal is to find an optimal policy that maximizes the expected cumulative reward subject to expected cumulative constraints. Motivated by the application of CMDPs in online learning of safety-critical systems, we focus on developing an algorithm that ensures constraint satisfaction during learning. To this end, we develop a zeroth-order interior point approach based on the log barrier function of the CMDP. Under the commonly assumed conditions of Fisher non-degeneracy and bounded transfer error of the policy parameterization, we establish the theoretical properties of the algorithm. In particular, in contrast to existing CMDP approaches that ensure policy feasibility only upon convergence, our algorithm guarantees feasibility of the policies during the learning process and converges to the optimal policy with a sample complexity of $O(\varepsilon^{-6})$. In comparison to the state-of-the-art policy gradient-based algorithm, C-NPG-PDA, our algorithm requires an additional $O(\varepsilon^{-2})$ samples to ensure policy feasibility during learning with same Fisher-non-degenerate parameterization.
翻译:我们考虑折扣无限时域约束马尔可夫决策过程(CMDPs),其目标是在满足期望累积约束的条件下,最大化期望累积奖励的最优策略。受CMDPs在安全关键系统在线学习中的应用启发,我们专注于开发一种在学习过程中确保约束满足的算法。为此,我们基于CMDP的对数障碍函数提出了一种零阶内点法。在通用的Fisher非退化性和策略参数化传递误差有界假设下,我们建立了该算法的理论性质。具体而言,与仅在收敛时保证策略可行性的现有CMDP方法不同,我们的算法在学习过程中确保策略的可行性,并以$O(\varepsilon^{-6})$的样本复杂度收敛到最优策略。与当前最先进的基于策略梯度的算法C-NPG-PDA相比,在相同的Fisher非退化参数化条件下,我们的算法需要额外$O(\varepsilon^{-2})$的样本以确保学习过程中的策略可行性。