Constrained Markov Decision Processes (CMDPs) are one of the common ways to model safe reinforcement learning problems, where constraint functions model the safety objectives. Lagrangian-based dual or primal-dual algorithms provide efficient methods for learning in CMDPs. For these algorithms, the currently known regret bounds in the finite-horizon setting allow for a "cancellation of errors"; one can compensate for a constraint violation in one episode with a strict constraint satisfaction in another. However, we do not consider such a behavior safe in practical applications. In this paper, we overcome this weakness by proposing a novel model-based dual algorithm OptAug-CMDP for tabular finite-horizon CMDPs. Our algorithm is motivated by the augmented Lagrangian method and can be performed efficiently. We show that during $K$ episodes of exploring the CMDP, our algorithm obtains a regret of $\tilde{O}(\sqrt{K})$ for both the objective and the constraint violation. Unlike existing Lagrangian approaches, our algorithm achieves this regret without the need for the cancellation of errors.
翻译:约束马尔可夫决策过程(CMDP)是建模安全强化学习问题的常用方式之一,其中约束函数用于建模安全目标。基于拉格朗日的对偶或原始-对偶算法为CMDP中的学习提供了高效方法。对于这些算法,当前在有限时域设定下的遗憾界允许“误差抵消”——一个回合中的约束违反可以用另一个回合中的严格约束满足来补偿。然而,我们认为这种行为在实际应用中并不安全。在本文中,我们通过提出一种新颖的基于模型的表格型有限时域CMDP对偶算法OptAug-CMDP克服了这一弱点。该算法受增广拉格朗日方法启发,可高效执行。我们证明,在探索CMDP的$K$个回合中,我们的算法在目标和约束违反方面均获得$\tilde{O}(\sqrt{K})$的遗憾界。与现有拉格朗日方法不同,我们的算法无需误差抵消即可实现这一遗憾界。