We consider discounted infinite horizon constrained Markov decision processes (CMDPs) where the goal is to find an optimal policy that maximizes the expected cumulative reward subject to expected cumulative constraints. Motivated by the application of CMDPs in online learning of safety-critical systems, we focus on developing a model-free and simulator-free algorithm that ensures constraint satisfaction during learning. To this end, we develop an interior point approach based on the log barrier function of the CMDP. Under the commonly assumed conditions of Fisher non-degeneracy and bounded transfer error of the policy parameterization, we establish the theoretical properties of the algorithm. In particular, in contrast to existing CMDP approaches that ensure policy feasibility only upon convergence, our algorithm guarantees the feasibility of the policies during the learning process and converges to the $\varepsilon$-optimal policy with a sample complexity of $\tilde{\mathcal{O}}(\varepsilon^{-6})$. In comparison to the state-of-the-art policy gradient-based algorithm, C-NPG-PDA, our algorithm requires an additional $\mathcal{O}(\varepsilon^{-2})$ samples to ensure policy feasibility during learning with the same Fisher non-degenerate parameterization.
翻译:我们考虑折扣无限时域约束马尔可夫决策过程(CMDPs),其目标是找到一个最优策略,该策略在满足期望累积约束的条件下最大化期望累积奖励。受CMDPs在安全关键系统在线学习中应用的启发,我们专注于开发一种无模型且无需模拟器的算法,以确保在学习过程中满足约束条件。为此,我们基于CMDP的对数障碍函数开发了一种内点法。在通常假设的策略参数化满足Fisher非退化性和有界转移误差的条件下,我们建立了算法的理论性质。特别地,与现有仅在收敛时保证策略可行性的CMDP方法不同,我们的算法在学习过程中保证了策略的可行性,并以$\tilde{\mathcal{O}}(\varepsilon^{-6})$的样本复杂度收敛到$\varepsilon$-最优策略。与最先进的基于策略梯度的算法C-NPG-PDA相比,在相同的Fisher非退化参数化条件下,我们的算法需要额外的$\mathcal{O}(\varepsilon^{-2})$样本以确保学习过程中策略的可行性。