Robust Koopman Control Barrier Filters for Safe Actor-Critic Reinforcement Learning

Safe reinforcement learning (RL) for robotic systems requires policies that improve task performance while satisfying state and input constraints during both training and deployment. Control barrier functions (CBFs) provide a principled mechanism for enforcing forward invariance through minimally invasive safety filters, but their use in model-free RL is limited by the need for accurate dynamics and hand-designed barrier certificates. We propose Robust Koopman-CBF SAC, a safety-filtered actor--critic framework that learns a finite-dimensional Koopman predictor from data, constructs affine CBF constraints in the lifted space, and enforces them through a quadratic-program safety layer. To account for finite-dimensional Koopman approximation error, the CBF condition is tightened using a projected residual margin estimated from held-out rollout data. The critic is trained on the executed safe action, while the actor is regularized toward the Koopman-CBF feasible set, reducing dependence on the filter over training. Across safe-control benchmarks, the method achieves zero constraint violations on CartPole stabilization and tracking while matching or exceeding unconstrained SAC returns. On high-dimensional Safety Gymnasium locomotion tasks, the method reduces violations in some settings but also exposes important limitations of first-order velocity barriers and linear EDMD models, motivating high-order and multi-step Koopman-CBF extensions. These results suggest that robust Koopman-CBF filters are a promising bridge between model-free RL and certifiable safety, while clarifying the structural conditions under which such filters remain effective.

翻译：为确保机器人系统在训练与部署过程中既提升任务性能又满足状态和输入约束，安全强化学习需要制定相应策略。控制障碍函数通过最小侵入性安全滤波器强制执行前向不变性，提供了严密的机制，但依赖精确动力学模型和手工设计的障碍证书，限制了其在无模型强化学习中的应用。我们提出鲁棒库普曼-控制障碍函数软演员-评论家框架，一种安全滤波的演员-评论家架构：从数据中学习有限维库普曼预测器，在升维空间构造仿射控制障碍函数约束，并通过二次规划安全层强制执行。为补偿有限维库普曼近似误差，采用基于保留滚动数据估计的投影残差裕度收紧控制障碍函数条件。评论家基于执行的安全动作进行训练，演员则向库普曼-控制障碍函数可行集正则化，从而减少训练过程中对滤波器的依赖。在安全控制基准测试中，该方法在倒立摆稳定与跟踪任务中实现零约束违反，且回报不低于甚至超越无约束软演员-评论家。在高维安全健身房运动任务中，该方法在部分场景降低了约束违反次数，但同时也暴露出其一阶速度障碍与线性经验动态模式分解模型的局限性，催生高阶与多步库普曼-控制障碍函数扩展的需求。结果表明，鲁棒库普曼-控制障碍函数滤波器是无模型强化学习与可验证安全性之间的桥梁，同时明确了此类滤波器保持有效的结构条件。