Near-Optimal Sample Complexity for Online Constrained MDPs

Safety is a fundamental challenge in reinforcement learning (RL), particularly in real-world applications such as autonomous driving, robotics, and healthcare. To address this, Constrained Markov Decision Processes (CMDPs) are commonly used to enforce safety constraints while optimizing performance. However, existing methods often suffer from significant safety violations or require a high sample complexity to generate near-optimal policies. We address two settings: relaxed feasibility, where small violations are allowed, and strict feasibility, where no violation is allowed. We propose a model-based primal-dual algorithm that balances regret and bounded constraint violations, drawing on techniques from online RL and constrained optimization. For relaxed feasibility, we prove that our algorithm returns an $\varepsilon$-optimal policy with $\varepsilon$-bounded violation with arbitrarily high probability, requiring $\tilde{O}\left(\frac{SAH^3}{\varepsilon^2}\right)$ learning episodes, matching the lower bound for unconstrained MDPs. For strict feasibility, we prove that our algorithm returns an $\varepsilon$-optimal policy with zero violation with arbitrarily high probability, requiring $\tilde{O}\left(\frac{SAH^5}{\varepsilon^2ζ^2}\right)$ learning episodes, where $ζ$ is the problem-dependent Slater constant characterizing the size of the feasible region. This result matches the lower bound for learning CMDPs with access to a generative model. Our results demonstrate that learning CMDPs in an online setting is as easy as learning with a generative model and is no more challenging than learning unconstrained MDPs when small violations are allowed.

翻译：安全性是强化学习（RL）中的一个基本挑战，在自动驾驶、机器人和医疗保健等现实应用中尤为突出。为解决此问题，约束马尔可夫决策过程（CMDPs）通常被用来在优化性能的同时强制执行安全约束。然而，现有方法往往存在显著的安全违规，或需要高样本复杂度才能生成近优策略。我们处理两种设定：松弛可行性（允许微小违规）和严格可行性（不允许任何违规）。我们提出一种基于模型的原对偶算法，该算法借鉴了在线RL和约束优化的技术，以平衡遗憾和有界约束违规。对于松弛可行性，我们证明该算法以任意高概率返回一个具有$\varepsilon$有界违规的$\varepsilon$最优策略，需要$\tilde{O}\left(\frac{SAH^3}{\varepsilon^2}\right)$个学习回合，这与无约束MDPs的下界匹配。对于严格可行性，我们证明该算法以任意高概率返回一个零违规的$\varepsilon$最优策略，需要$\tilde{O}\left(\frac{SAH^5}{\varepsilon^2ζ^2}\right)$个学习回合，其中$ζ$是问题相关的斯莱特常数，用于刻画可行域的大小。该结果与通过生成模型学习CMDPs的下界匹配。我们的结果表明，在线环境下学习CMDPs与通过生成模型学习同样容易，并且在允许微小违规时，其难度不高于学习无约束MDPs。