The inability to naturally enforce safety in Reinforcement Learning (RL), with limited failures, is a core challenge impeding its use in real-world applications. One notion of safety of vast practical relevance is the ability to avoid (unsafe) regions of the state space. Though such a safety goal can be captured by an action-value-like function, a.k.a. safety critics, the associated operator lacks the desired contraction and uniqueness properties that the classical Bellman operator enjoys. In this work, we overcome the non-contractiveness of safety critic operators by leveraging that safety is a binary property. To that end, we study the properties of the binary safety critic associated with a deterministic dynamical system that seeks to avoid reaching an unsafe region. We formulate the corresponding binary Bellman equation (B2E) for safety and study its properties. While the resulting operator is still non-contractive, we fully characterize its fixed points representing--except for a spurious solution--maximal persistently safe regions of the state space that can always avoid failure. We provide an algorithm that, by design, leverages axiomatic knowledge of safe data to avoid spurious fixed points.
翻译:强化学习在现实应用中面临的核心挑战之一,是在有限失败次数下无法自然保证安全性。一种具有广泛实践意义的安全概念是避免进入状态空间中的(不安全)区域。尽管此类安全目标可通过类似动作价值函数的"安全评价函数"来表征,但相关算子缺乏经典贝尔曼算子所具有的收缩性与唯一性。本研究通过利用安全性作为二元属性的特性,克服了安全评价算子非收缩性的问题。为此,我们研究了与确定性动力系统相关联的二元安全评价函数的特性,该系统旨在避免到达不安全区域。我们提出了对应的安全二元贝尔曼方程(B2E)并分析了其性质。尽管所得算子仍不具备收缩性,我们完整刻画了其不动点——除一个虚假解外——这些不动点表示状态空间中始终能避免失败的最大持久安全区域。我们提供了一种算法,该算法通过设计利用安全数据的公理知识来避免虚假不动点。