Provably Optimal Reinforcement Learning under Safety Filtering

Recent advances in reinforcement learning (RL) enable its use on increasingly complex tasks, but the lack of formal safety guarantees still limits its application in safety-critical settings. A common practical approach is to augment the RL policy with a safety filter that overrides unsafe actions to prevent failures during both training and deployment. However, safety filtering is often perceived as sacrificing performance and hindering the learning process. We show that this perceived safety-performance tradeoff is not inherent and prove, for the first time, that enforcing safety with a sufficiently permissive safety filter does not degrade asymptotic performance. We formalize RL safety with a safety-critical Markov decision process (SC-MDP), which requires categorical, rather than high-probability, avoidance of catastrophic failure states. Additionally, we define an associated filtered MDP in which all actions result in safe effects, thanks to a safety filter that is considered to be a part of the environment. Our main theorem establishes that (i) learning in the filtered MDP is safe categorically, (ii) standard RL convergence carries over to the filtered MDP, and (iii) any policy that is optimal in the filtered MDP-when executed through the same filter-achieves the same asymptotic return as the best safe policy in the SC-MDP, yielding a complete separation between safety enforcement and performance optimization. We validate the theory on Safety Gymnasium with representative tasks and constraints, observing zero violations during training and final performance matching or exceeding unfiltered baselines. Together, these results shed light on a long-standing question in safety-filtered learning and provide a simple, principled recipe for safe RL: train and deploy RL policies with the most permissive safety filter that is available.

翻译：强化学习（RL）的最新进展使其能够应用于日益复杂的任务，但缺乏形式化的安全保证仍然限制了其在安全关键场景中的应用。一种常见的实用方法是为RL策略配备一个安全过滤器，该过滤器会覆盖不安全动作，以防止在训练和部署过程中发生故障。然而，安全过滤通常被认为是以牺牲性能和阻碍学习过程为代价的。我们证明，这种感知到的安全-性能权衡并非固有，并首次证明，使用一个足够宽松的安全过滤器来强制执行安全性不会降低渐近性能。我们使用安全关键马尔可夫决策过程（SC-MDP）来形式化RL安全性，该过程要求绝对地（而非高概率地）避免灾难性故障状态。此外，我们定义了一个关联的过滤MDP，在该MDP中，由于安全过滤器被视为环境的一部分，所有动作都会产生安全效果。我们的主要定理确立了：(i) 在过滤MDP中学习是绝对安全的，(ii) 标准RL收敛性在过滤MDP中成立，以及(iii) 任何在过滤MDP中最优的策略——当通过相同的过滤器执行时——能达到与SC-MDP中最佳安全策略相同的渐近回报，从而实现了安全强制执行与性能优化的完全分离。我们在Safety Gymnasium上使用代表性任务和约束验证了该理论，观察到训练期间零违规，且最终性能匹配或超过了未过滤的基线。总之，这些结果阐明了安全过滤学习中一个长期存在的问题，并为安全RL提供了一个简单、原则性的方案：使用可用的最宽松的安全过滤器来训练和部署RL策略。