We study safe online reinforcement learning in Constrained Markov Decision Processes (CMDPs) under strong regret and violation metrics, which forbid error cancellation over time. Existing primal-dual methods that achieve sublinear strong reward regret inevitably incur growing strong constraint violation or are restricted to average-iterate convergence due to inherent oscillations. To address these limitations, we propose the Flexible safety Domain Optimization via Margin-regularized Exploration (FlexDOME) algorithm, the first to provably achieve near-constant $\tilde{O}(1)$ strong constraint violation alongside sublinear strong regret and non-asymptotic last-iterate convergence. FlexDOME incorporates time-varying safety margins and regularization terms into the primal-dual framework. Our theoretical analysis relies on a novel term-wise asymptotic dominance strategy, where the safety margin is rigorously scheduled to asymptotically majorize the functional decay rates of the optimization and statistical errors, thereby clamping cumulative violations to a near-constant level. Furthermore, we establish non-asymptotic last-iterate convergence guarantees via a policy-dual Lyapunov argument. Experiments corroborate our theoretical findings.
翻译:本文研究约束马尔可夫决策过程(CMDPs)中具有强遗憾和强违反度量的安全在线强化学习,这些度量禁止误差随时间抵消。现有实现次线性强奖励遗憾的原始对偶方法,由于固有的振荡,不可避免地导致增长的强约束违反,或仅限于平均迭代收敛。为应对这些局限,我们提出基于裕度正则化探索的灵活安全域优化(FlexDOME)算法,这是首个可证明实现近恒定 $\tilde{O}(1)$ 强约束违反,同时具备次线性强遗憾与非渐近末次迭代收敛的算法。FlexDOME 将时变安全裕度与正则化项纳入原始对偶框架。我们的理论分析依赖于一种新颖的逐项渐近主导策略,其中安全裕度被严格调度以渐近主导优化误差与统计误差的函数衰减率,从而将累积违反钳制在近恒定水平。此外,我们通过策略-对偶李雅普诺夫论证建立了非渐近末次迭代收敛保证。实验验证了我们的理论结果。