Safe reinforcement learning (RL) trains a policy to maximize the task reward while satisfying safety constraints. While prior works focus on the performance optimality, we find that the optimal solutions of many safe RL problems are not robust and safe against carefully designed observational perturbations. We formally analyze the unique properties of designing effective observational adversarial attackers in the safe RL setting. We show that baseline adversarial attack techniques for standard RL tasks are not always effective for safe RL and propose two new approaches - one maximizes the cost and the other maximizes the reward. One interesting and counter-intuitive finding is that the maximum reward attack is strong, as it can both induce unsafe behaviors and make the attack stealthy by maintaining the reward. We further propose a robust training framework for safe RL and evaluate it via comprehensive experiments. This paper provides a pioneer work to investigate the safety and robustness of RL under observational attacks for future safe RL studies. Code is available at: \url{https://github.com/liuzuxin/safe-rl-robustness}
翻译:安全强化学习旨在训练一个策略,使其在满足安全约束的同时最大化任务奖励。尽管先前的工作主要关注性能最优性,但我们发现许多安全强化学习问题的最优解在面对精心设计的观测扰动时并不鲁棒且不安全。我们形式化分析了在安全强化学习场景下设计有效观测对抗攻击者的独特性质。研究表明,针对标准强化学习任务的基线对抗攻击技术并不总是对安全强化学习有效,并提出了两种新方法——一种最大化代价,另一种最大化奖励。一个有趣且反直觉的发现是,最大奖励攻击非常强大,因为它既能诱发不安全行为,又能通过保持奖励使得攻击隐蔽。我们进一步提出了一种面向安全强化学习的鲁棒训练框架,并通过大量实验对其进行评估。本文为未来安全强化学习研究提供了在观测攻击下探究强化学习安全性与鲁棒性的开创性工作。代码见:\url{https://github.com/liuzuxin/safe-rl-robustness}