Safe reinforcement learning (RL) with assured satisfaction of hard state constraints during training has recently received a lot of attention. Safety filters, e.g., based on control barrier functions (CBFs), provide a promising way for safe RL via modifying the unsafe actions of an RL agent on the fly. Existing safety filter-based approaches typically involve learning of uncertain dynamics and quantifying the learned model error, which leads to conservative filters before a large amount of data is collected to learn a good model, thereby preventing efficient exploration. This paper presents a method for safe and efficient RL using disturbance observers (DOBs) and control barrier functions (CBFs). Unlike most existing safe RL methods that deal with hard state constraints, our method does not involve model learning, and leverages DOBs to accurately estimate the pointwise value of the uncertainty, which is then incorporated into a robust CBF condition to generate safe actions. The DOB-based CBF can be used as a safety filter with model-free RL algorithms by minimally modifying the actions of an RL agent whenever necessary to ensure safety throughout the learning process. Simulation results on a unicycle and a 2D quadrotor demonstrate that the proposed method outperforms a state-of-the-art safe RL algorithm using CBFs and Gaussian processes-based model learning, in terms of safety violation rate, and sample and computational efficiency.
翻译:安全强化学习要求在训练过程中确保硬状态约束的满足,近期受到了广泛关注。安全过滤器(例如基于控制障碍函数的方法)能够实时修正强化学习智能体的不安全动作,为安全强化学习提供了一种有前景的方案。现有基于安全过滤器的方法通常涉及不确定动力学学习及建模误差量化,这导致在收集大量数据以学习良好模型之前,过滤器存在保守性,从而阻碍了高效探索。本文提出一种利用扰动观测器和控制障碍函数实现安全高效强化学习的方法。与大多数处理硬状态约束的现有安全强化学习方法不同,本方法不涉及模型学习,而是利用扰动观测器精确估计不确定性的逐点值,并将其融入鲁棒控制障碍函数条件以生成安全动作。基于扰动观测器的控制障碍函数可作为无模型强化学习算法的安全过滤器,在必要时最小限度修正智能体的动作,从而确保整个学习过程的安全性。在独轮车和二维四旋翼飞行器上的仿真结果表明,相较于采用控制障碍函数与高斯过程模型学习的先进安全强化学习算法,本方法在安全违规率、样本效率及计算效率方面均表现更优。