Shielding is widely used to enforce safety in reinforcement learning (RL), ensuring that an agent's actions remain compliant with formal specifications. Classical shielding approaches, however, are often static, in the sense that they assume fixed logical specifications and hand-crafted abstractions. While these static shields provide safety under nominal assumptions, they fail to adapt when environment assumptions are violated. In this paper, we develop an adaptive shielding framework based on based on Generalized Reactivity of rank 1 (GR(1)) specifications, a tractable and expressive fragment of Linear Temporal Logic (LTL) that captures both safety and liveness properties. Our method detects environment assumption violations at runtime and employs Inductive Logic Programming (ILP) to automatically repair GR(1) specifications online, in a systematic and interpretable way. This ensures that the shield evolves gracefully, ensuring liveness is achievable and minimally weakening goals only when necessary. We consider two case studies: Minepump and Atari Seaquest; showing that (i) static symbolic controllers are often severely suboptimal when optimizing for auxiliary rewards, and (ii) RL agents equipped with our adaptive shield maintain near-optimal reward and perfect logical compliance compared with static shields.
翻译:屏蔽技术被广泛用于强化学习中强制执行安全性,确保智能体的行为始终符合形式化规范。然而,经典屏蔽方法通常是静态的,即它们假设逻辑规范和手工构建的抽象模型是固定的。尽管这些静态屏蔽器在名义假设下能提供安全保障,但当环境假设被违反时,它们无法进行自适应调整。本文提出了一种基于广义反应性等级1(GR(1)规范的自适应屏蔽框架,GR(1)是线性时序逻辑中一个可处理且表达能力强的片段,能够同时刻画安全性和活性属性。我们的方法在运行时检测环境假设违反情况,并采用归纳逻辑编程在线自动修复GR(1)规范,修复过程具有系统性和可解释性。这确保了屏蔽器能够优雅地演化,在保证活性可达的同时,仅在必要时最小化地弱化目标。我们通过两个案例研究(矿用排水系统和Atari Seaquest游戏)表明:(1)在优化辅助奖励时,静态符号控制器通常表现出严重的次优性;(2)与静态屏蔽器相比,配备自适应屏蔽器的强化学习智能体在保持接近最优奖励的同时,实现了完美的逻辑合规性。