Cost functions are commonly employed in Safe Deep Reinforcement Learning (DRL). However, the cost is typically encoded as an indicator function due to the difficulty of quantifying the risk of policy decisions in the state space. Such an encoding requires the agent to visit numerous unsafe states to learn a cost-value function to drive the learning process toward safety. Hence, increasing the number of unsafe interactions and decreasing sample efficiency. In this paper, we investigate an alternative approach that uses domain knowledge to quantify the risk in the proximity of such states by defining a violation metric. This metric is computed by verifying task-level properties, shaped as input-output conditions, and it is used as a penalty to bias the policy away from unsafe states without learning an additional value function. We investigate the benefits of using the violation metric in standard Safe DRL benchmarks and robotic mapless navigation tasks. The navigation experiments bridge the gap between Safe DRL and robotics, introducing a framework that allows rapid testing on real robots. Our experiments show that policies trained with the violation penalty achieve higher performance over Safe DRL baselines and significantly reduce the number of visited unsafe states.
翻译:成本函数通常被用于安全深度强化学习(Safe DRL)中。然而,由于在状态空间中量化策略决策风险的难度,成本通常被编码为指示函数。这种编码方式要求智能体访问大量不安全状态,以学习一个成本价值函数来驱动学习过程趋向安全,从而增加了不安全交互次数并降低了样本效率。本文探索了一种替代方法,利用领域知识通过定义违规度量来量化此类状态临近区域的风险。该度量通过验证形如输入-输出条件的任务级属性计算得出,并作为惩罚项来引导策略偏离不安全状态,而无需学习额外的价值函数。我们在标准安全深度强化学习基准测试及机器人无地图导航任务中研究了使用违规度量的优势。导航实验弥合了安全深度强化学习与机器人学之间的差距,引入了一个可在实体机器人上快速测试的框架。实验表明,采用违规惩罚训练的策略在安全深度强化学习基线上实现了更高性能,并显著减少了访问的不安全状态数量。