Safety is essential for deploying Deep Reinforcement Learning (DRL) algorithms in real-world scenarios. Recently, verification approaches have been proposed to allow quantifying the number of violations of a DRL policy over input-output relationships, called properties. However, such properties are hard-coded and require task-level knowledge, making their application intractable in challenging safety-critical tasks. To this end, we introduce the Collection and Refinement of Online Properties (CROP) framework to design properties at training time. CROP employs a cost signal to identify unsafe interactions and use them to shape safety properties. Hence, we propose a refinement strategy to combine properties that model similar unsafe interactions. Our evaluation compares the benefits of computing the number of violations using standard hard-coded properties and the ones generated with CROP. We evaluate our approach in several robotic mapless navigation tasks and demonstrate that the violation metric computed with CROP allows higher returns and lower violations over previous Safe DRL approaches.
翻译:安全性是在实际场景中部署深度强化学习(DRL)算法的关键。近期,已有验证方法被提出,用于量化深度强化学习策略在输入-输出关系(称为性质)上的违规次数。然而,这些性质需硬编码且依赖任务级知识,导致其难以应用于具有挑战性的安全关键任务。为此,我们提出在线性质收集与精化(CROP)框架,用于在训练阶段设计性质。CROP利用代价信号识别不安全交互,并将其用于塑造安全性质。进而,我们提出一种精化策略,以合并建模相似不安全交互的性质。通过实验评估,我们对比了使用标准硬编码性质与CROP生成性质计算违规次数的效果。在多个机器人无地图导航任务中的评估表明:相较于先前安全深度强化学习方法,基于CROP计算的违规度量可获得更高回报且更低违规次数。