Constrained Markov decision processes (CMDPs) are a common way to model safety constraints in reinforcement learning. State-of-the-art methods for efficiently solving CMDPs are based on primal-dual algorithms. For these algorithms, all currently known regret bounds allow for error cancellations -- one can compensate for a constraint violation in one round with a strict constraint satisfaction in another. This makes the online learning process unsafe since it only guarantees safety for the final (mixture) policy but not during learning. As Efroni et al. (2020) pointed out, it is an open question whether primal-dual algorithms can provably achieve sublinear regret if we do not allow error cancellations. In this paper, we give the first affirmative answer. We first generalize a result on last-iterate convergence of regularized primal-dual schemes to CMDPs with multiple constraints. Building upon this insight, we propose a model-based primal-dual algorithm to learn in an unknown CMDP. We prove that our algorithm achieves sublinear regret without error cancellations.
翻译:约束马尔可夫决策过程是强化学习中建模安全约束的常见方式。目前高效求解约束马尔可夫决策过程的前沿方法基于原始-对偶算法。对于这类算法,所有已知的遗憾界都允许误差抵消——即可以用某一轮严格满足约束来补偿另一轮的约束违反。这使得在线学习过程不安全,因为它仅保证最终(混合)策略的安全性,而非学习过程中的安全性。正如Efroni等人(2020)所指出的,如果不允许误差抵消,原始-对偶算法能否可证明实现次线性遗憾仍是一个未解决问题。本文首次给出肯定答案。我们首先将正则化原始-对偶方案的最后迭代收敛性结论推广至含多个约束的约束马尔可夫决策过程。基于这一发现,我们提出一种基于模型的原始-对偶算法来学习未知约束马尔可夫决策过程。我们证明该算法能在无误差抵消的情况下实现次线性遗憾。