Ensuring safety in Reinforcement Learning (RL), typically framed as a Constrained Markov Decision Process (CMDP), is crucial for real-world exploration applications. Current approaches in handling CMDP struggle to balance optimality and feasibility, as direct optimization methods cannot ensure state-wise in-training safety, and projection-based methods correct actions inefficiently through lengthy iterations. To address these challenges, we propose Adaptive Chance-constrained Safeguards (ACS), an adaptive, model-free safe RL algorithm using the safety recovery rate as a surrogate chance constraint to iteratively ensure safety during exploration and after achieving convergence. Theoretical analysis indicates that the relaxed probabilistic constraint sufficiently guarantees forward invariance to the safe set. And extensive experiments conducted on both simulated and real-world safety-critical tasks demonstrate its effectiveness in enforcing safety (nearly zero-violation) while preserving optimality (+23.8%), robustness, and fast response in stochastic real-world settings.
翻译:在强化学习(RL)中确保安全性(通常被建模为约束马尔可夫决策过程,CMDP)对于真实世界的探索应用至关重要。当前处理CMDP的方法在最优性与可行性之间难以取得平衡:直接优化方法无法确保训练过程中的状态级安全性,而基于投影的方法通过冗长的迭代效率低下地修正动作。为解决这些挑战,我们提出自适应机会约束保护机制(ACS),一种自适应、无模型的安全RL算法,该算法利用安全恢复率作为代理机会约束,在探索期间及收敛后迭代地确保安全性。理论分析表明,该松弛概率约束能充分保证对安全集的前向不变性。在模拟和真实世界安全关键任务上开展的大量实验证明,该方法在保持最优性(+23.8%)、鲁棒性和随机真实环境中的快速响应的同时,能有效执行安全性保障(近乎零违规)。