Reinforcement learning (RL), while powerful and expressive, can often prioritize performance at the expense of safety. Yet safety violations can lead to catastrophic outcomes in real-world deployments. Control Barrier Functions (CBFs) offer a principled method to enforce dynamic safety -- traditionally deployed online via safety filters. While the result is safe behavior, the fact that the RL policy does not have knowledge of the CBF can lead to conservative behaviors. This paper proposes CBF-RL, a framework for generating safe behaviors with RL by enforcing CBFs in training. CBF-RL has two key attributes: (1) minimally modifying a nominal RL policy to encode safety constraints via a CBF term, (2) and safety filtering of the policy rollouts in training. Theoretically, we prove that continuous-time safety filters can be deployed via closed-form expressions on discrete-time roll-outs. Practically, we demonstrate that CBF-RL internalizes the safety constraints in the learned policy -- both enforcing safer actions and biasing towards safer rewards -- enabling safe deployment without the need for an online safety filter. We validate our framework through ablation studies on navigation tasks and on the Unitree G1 humanoid robot, where CBF-RL enables safer exploration, faster convergence, and robust performance under uncertainty, enabling the humanoid robot to avoid obstacles and climb stairs safely in real-world settings without a runtime safety filter.
翻译:强化学习(RL)虽然功能强大且表达能力强,但常常以牺牲安全性为代价来优先考虑性能。然而,在实际部署中,违反安全约束可能导致灾难性后果。控制屏障函数(CBFs)提供了一种强制执行动态安全性的原则性方法——传统上通过安全过滤器在线部署。虽然结果能产生安全行为,但由于RL策略不了解CBF,可能导致保守行为。本文提出CBF-RL,这是一种通过在训练中强制执行CBF来生成安全RL行为的框架。CBF-RL具有两个关键特性:(1)通过CBF项对名义RL策略进行最小修改以编码安全约束;(2)在训练中对策略展开进行安全过滤。理论上,我们证明了连续时间安全过滤器可以通过离散时间展开的闭式表达式进行部署。实践上,我们证明CBF-RL将安全约束内化到学习到的策略中——既强制执行更安全的动作,又偏向于更安全的奖励——从而无需在线安全过滤器即可实现安全部署。我们通过在导航任务和Unitree G1人形机器人上的消融实验验证了该框架,其中CBF-RL实现了更安全的探索、更快的收敛性以及在不确定性下的鲁棒性能,使人形机器人能够在没有运行时安全过滤器的情况下,在真实环境中安全避障和爬楼梯。