CBF-RL: Safety Filtering Reinforcement Learning in Training with Control Barrier Functions

from arxiv, To appear at ICRA 2026; sample code for the navigation example with CBF-RL reward core construction can be found at https://github.com/lzyang2000/cbf-rl-navigation-demo

Reinforcement learning (RL), while powerful and expressive, can often prioritize performance at the expense of safety. Yet safety violations can lead to catastrophic outcomes in real-world deployments. Control Barrier Functions (CBFs) offer a principled method to enforce dynamic safety -- traditionally deployed online via safety filters. While the result is safe behavior, the fact that the RL policy does not have knowledge of the CBF can lead to conservative behaviors. This paper proposes CBF-RL, a framework for generating safe behaviors with RL by enforcing CBFs in training. CBF-RL has two key attributes: (1) minimally modifying a nominal RL policy to encode safety constraints via a CBF term, (2) and safety filtering of the policy rollouts in training. Theoretically, we prove that continuous-time safety filters can be deployed via closed-form expressions on discrete-time roll-outs. Practically, we demonstrate that CBF-RL internalizes the safety constraints in the learned policy -- both enforcing safer actions and biasing towards safer rewards -- enabling safe deployment without the need for an online safety filter. We validate our framework through ablation studies on navigation tasks and on the Unitree G1 humanoid robot, where CBF-RL enables safer exploration, faster convergence, and robust performance under uncertainty, enabling the humanoid robot to avoid obstacles and climb stairs safely in real-world settings without a runtime safety filter.

翻译：强化学习虽然强大且表达力强，但常以牺牲安全性为代价追求性能。然而在实际部署中，违反安全可能导致灾难性后果。控制屏障函数提供了一种在动态环境中强制执行安全性的原则性方法——传统上通过在线安全过滤器进行部署。虽然这能保证安全行为，但强化学习策略不了解CBF会导致保守行为。本文提出CBF-RL框架，通过在训练过程中强制执行CBF来生成强化学习的安全行为。CBF-RL具有两个关键特性：（1）通过CBF项最小限度地修改名义强化学习策略以编码安全约束；（2）在训练过程中对策略轨迹进行安全过滤。理论上，我们证明了连续时间安全过滤器可通过闭式表达式在离散时间轨迹上部署。实践上，我们证明CBF-RL能将安全约束内化至学习策略中——既强制执行更安全的动作，又偏向于更安全的奖励——使得无需在线安全过滤器即可安全部署。我们通过导航任务消融实验和Unitree G1人形机器人实验验证了该框架，在不确定性条件下，CBF-RL实现了更安全的探索、更快的收敛和稳健的性能，使人形机器人在无运行时安全过滤器的实际场景中能够安全避障和爬楼梯。