Implicit Safe Set Algorithm for Provably Safe Reinforcement Learning

Deep reinforcement learning (DRL) has demonstrated remarkable performance in many continuous control tasks. However, a significant obstacle to the real-world application of DRL is the lack of safety guarantees. Although DRL agents can satisfy system safety in expectation through reward shaping, designing agents to consistently meet hard constraints (e.g., safety specifications) at every time step remains a formidable challenge. In contrast, existing work in the field of safe control provides guarantees on persistent satisfaction of hard safety constraints. However, these methods require explicit analytical system dynamics models to synthesize safe control, which are typically inaccessible in DRL settings. In this paper, we present a model-free safe control algorithm, the implicit safe set algorithm, for synthesizing safeguards for DRL agents that ensure provable safety throughout training. The proposed algorithm synthesizes a safety index (barrier certificate) and a subsequent safe control law solely by querying a black-box dynamic function (e.g., a digital twin simulator). Moreover, we theoretically prove that the implicit safe set algorithm guarantees finite time convergence to the safe set and forward invariance for both continuous-time and discrete-time systems. We validate the proposed algorithm on the state-of-the-art Safety Gym benchmark, where it achieves zero safety violations while gaining $95\% \pm 9\%$ cumulative reward compared to state-of-the-art safe DRL methods. Furthermore, the resulting algorithm scales well to high-dimensional systems with parallel computing.

翻译：深度强化学习（DRL）在众多连续控制任务中展现出卓越性能。然而，DRL在现实应用中的主要障碍在于缺乏安全保障。尽管通过奖励塑形可使DRL代理在期望意义上满足系统安全性，但设计能在每时间步持续满足硬约束（如安全规范）的代理仍是一项严峻挑战。相比之下，安全控制领域的现有研究为解决硬安全约束的持续满足问题提供了保障。然而，这些方法需要明确的解析系统动力学模型来综合安全控制律，而这在DRL场景中通常难以获得。本文提出一种无模型安全控制算法——隐式安全集算法，用于为DRL代理合成可在训练全过程中提供可证明安全保障的防护机制。该算法仅需查询黑箱动态函数（如数字孪生仿真器），即可综合得到安全指标（障碍证书）及相应的安全控制律。此外，我们从理论上证明：对于连续时间与离散时间系统，隐式安全集算法均可保证有限时间收敛至安全集并保持正向不变性。我们在当前最先进的Safety Gym基准上验证了所提算法，结果表明：与先进的安全DRL方法相比，该算法在实现零安全违规的同时，累计奖励达到`$95\% \pm 9\%$`。得益于此，所提算法通过并行计算可良好地扩展至高维系统。