Constrained Markov Decision Processes (CMDPs) are one of the common ways to model safe reinforcement learning problems, where the safety objectives are modeled by constraint functions. Lagrangian-based dual or primal-dual algorithms provide efficient methods for learning in CMDPs. For these algorithms, the currently known regret bounds in the finite-horizon setting allow for a \textit{cancellation of errors}; that is, one can compensate for a constraint violation in one episode with a strict constraint satisfaction in another episode. However, in practical applications, we do not consider such a behavior safe. In this paper, we overcome this weakness by proposing a novel model-based dual algorithm \textsc{OptAug-CMDP} for tabular finite-horizon CMDPs. Our algorithm is motivated by the augmented Lagrangian method and can be performed efficiently. We show that during $K$ episodes of exploring the CMDP, our algorithm obtains a regret of $\tilde{O}(\sqrt{K})$ for both the objective and the constraint violation. Unlike existing Lagrangian approaches, our algorithm achieves this regret without the need for the cancellation of errors.
翻译:约束马尔可夫决策过程(CMDPs)是建模安全强化学习问题的常见方式之一,其中安全目标通过约束函数进行建模。基于拉格朗日的对偶或原始-对偶算法为CMDP中的学习提供了高效方法。针对这些算法,当前已知的有限时域设定下的遗憾界允许\textit{误差抵消},即一次回合中的约束违反可通过另一次回合中的严格约束满足来补偿。然而在实际应用中,我们并不认为此类行为是安全的。本文通过提出一种针对表格型有限时域CMDP的基于模型的新型对偶算法\textsc{OptAug-CMDP},克服了这一缺陷。该算法受增广拉格朗日方法启发,可高效执行。我们证明,在探索CMDP的$K$个回合中,该算法在目标函数和约束违反方面均能达到$\tilde{O}(\sqrt{K})$的遗憾界。与现有拉格朗日方法不同,我们的算法无需误差抵消即可实现此遗憾界。