While reinforcement learning produces very promising results for many applications, its main disadvantage is the lack of safety guarantees, which prevents its use in safety-critical systems. In this work, we address this issue by a safety shield for nonlinear continuous systems that solve reach-avoid tasks. Our safety shield prevents applying potentially unsafe actions from a reinforcement learning agent by projecting the proposed action to the closest safe action. This approach is called action projection and is implemented via mixed-integer optimization. The safety constraints for action projection are obtained by applying parameterized reachability analysis using polynomial zonotopes, which enables to accurately capture the nonlinear effects of the actions on the system. In contrast to other state-of-the-art approaches for action projection, our safety shield can efficiently handle input constraints and dynamic obstacles, eases incorporation of the spatial robot dimensions into the safety constraints, guarantees robust safety despite process noise and measurement errors, and is well suited for high-dimensional systems, as we demonstrate on several challenging benchmark systems.
翻译:尽管强化学习在诸多应用中取得了显著成果,其主要缺陷在于缺乏安全性保障,这阻碍了其在安全关键系统中的部署。针对这一问题,本文提出一种面向求解可达避障任务的非线性连续系统的安全防护机制。该安全防护通过将强化学习智能体拟采取的动作投影至最近安全动作,从而阻止潜在危险动作的实施。这种被称为动作投影的方法通过混合整数优化实现。动作投影的安全约束基于多项式超椭球参数化可达性分析获得,这能够精确刻画动作对系统的非线性影响。相较于其他先进的投影方法,本文防护机制可高效处理输入约束与动态障碍物,便于将机器人空间几何尺寸融入安全约束,在存在过程噪声与测量误差的工况下仍可保证鲁棒安全性,并且适用于高维系统——这在多个具有挑战性的基准系统实验中已得到验证。