Shielding is an effective approach to formally guarantee the safety of reinforcement learning agents in Markov decision processes (MDPs). However, existing shielding techniques typically assume knowledge of the safety-relevant transition dynamics - a requirement that is seldom met in practice. To address this limitation, we introduce a novel shielding framework for robust MDPs (RMDPs), i.e., MDPs with sets of transition probabilities. We define safety as the satisfaction of a linear temporal logic (LTL) formula with a certain threshold probability under the worst-case transition probabilities of the RMDP. We prove that our shielding framework is both sound and optimal for the RMDP: every policy admissible by the shield is safe, and conversely, every safe RMDP policy is admissible by the shield. We combine our approach with existing sampling methods for learning transition probabilities of MDPs with probably approximately correct (PAC) guarantees. This combination enables the construction of shields for MDPs that, with high confidence, guarantee safety while remaining minimally restrictive. Our experiments show that our shields for learned RMDPs guarantee safety in unknown MDPs while recovering strong expected return as the number of samples increases.
翻译:屏蔽是一种有效方法,可在马尔可夫决策过程(MDP)中形式化保证强化学习智能体的安全性。然而,现有屏蔽技术通常假设已知与安全相关的转移动态——这一条件在实践中极少满足。为解决此限制,我们针对鲁棒MDP(RMDP)(即具有转移概率集合的MDP)提出一种新型屏蔽框架。我们将安全性定义为:在RMDP的最坏情况转移概率下,线性时序逻辑(LTL)公式以特定阈值概率得到满足。我们证明,所提屏蔽框架对于RMDP既完备又最优:经屏蔽允许的每个策略都是安全的,反之,每个安全的RMDP策略均被屏蔽允许。我们将该方法与现有采样方法相结合,用于学习具有概率近似正确(PAC)保证的MDP转移概率。该组合能构建具有高置信度保证安全性且最小程度限制的MDP屏蔽。实验表明,基于所学RMDP的屏蔽可确保未知MDP中的安全性,同时随着样本数量增加,能够恢复较强的期望回报。