Partially observable Markov decision processes (POMDPs) have been widely used in many robotic applications for sequential decision-making under uncertainty. POMDP online planning algorithms such as Partially Observable Monte-Carlo Planning (POMCP) can solve very large POMDPs with the goal of maximizing the expected return. But the resulting policies cannot provide safety guarantees which are imperative for real-world safety-critical tasks (e.g., autonomous driving). In this work, we consider safety requirements represented as almost-sure reach-avoid specifications (i.e., the probability to reach a set of goal states is one and the probability to reach a set of unsafe states is zero). We compute shields that restrict unsafe actions which would violate the almost-sure reach-avoid specifications. We then integrate these shields into the POMCP algorithm for safe POMDP online planning. We propose four distinct shielding methods, differing in how the shields are computed and integrated, including factored variants designed to improve scalability. Experimental results on a set of benchmark domains demonstrate that the proposed shielding methods successfully guarantee safety (unlike the baseline POMCP without shielding) on large POMDPs, with negligible impact on the runtime for online planning.
翻译:部分可观测马尔可夫决策过程(POMDP)已广泛应用于许多机器人应用中,用于处理不确定性下的序列决策。部分可观测蒙特卡洛规划(POMCP)等POMDP在线规划算法能以最大化期望回报为目标求解大规模POMDP问题,但所得策略无法提供安全保证,而这对于实际安全关键任务(如自动驾驶)至关重要。本文考虑以几乎必然可达-避让规范(即到达目标状态集概率为1,到达不安全状态集概率为0)表示的安全需求。我们计算可屏蔽违反几乎必然可达-避让规范的不安全动作的防护层,随后将这些防护层集成到POMCP算法中,用于安全POMDP在线规划。我们提出四种不同的屏蔽方法,其差异在于防护层的计算与集成方式,包括为提升可扩展性设计的因子化变体。在基准测试领域集合上的实验结果表明,所提出的屏蔽方法能够在大规模POMDP上成功保障安全性(区别于未加屏蔽的基线POMCP),且对在线规划运行时间的影响可忽略不计。