Reinforcement learning (RL) agents are vulnerable to adversarial disturbances, which can deteriorate task performance or compromise safety specifications. Existing methods either address safety requirements under the assumption of no adversary (e.g., safe RL) or only focus on robustness against performance adversaries (e.g., robust RL). Learning one policy that is both safe and robust remains a challenging open problem. The difficulty is how to tackle two intertwined aspects in the worst cases: feasibility and optimality. Optimality is only valid inside a feasible region, while identification of maximal feasible region must rely on learning the optimal policy. To address this issue, we propose a systematic framework to unify safe RL and robust RL, including problem formulation, iteration scheme, convergence analysis and practical algorithm design. This unification is built upon constrained two-player zero-sum Markov games. A dual policy iteration scheme is proposed, which simultaneously optimizes a task policy and a safety policy. The convergence of this iteration scheme is proved. Furthermore, we design a deep RL algorithm for practical implementation, called dually robust actor-critic (DRAC). The evaluations with safety-critical benchmarks demonstrate that DRAC achieves high performance and persistent safety under all scenarios (no adversary, safety adversary, performance adversary), outperforming all baselines significantly.
翻译:摘要:强化学习(RL)智能体易受对抗干扰影响,这可能导致任务性能下降或违反安全规范。现有方法要么假设无对抗者情形下处理安全需求(如安全强化学习),要么仅关注对抗性能的鲁棒性(如鲁棒强化学习)。学习一个既安全又鲁棒的策略仍是一个具有挑战性的开放问题。其难点在于如何在最坏情形下处理两个相互交织的方面:可行性与最优性。最优性仅当处于可行域内时成立,而最大可行域的识别必须依赖于最优策略的学习。针对这一问题,我们提出了一种统一安全强化学习与鲁棒强化学习的系统性框架,涵盖问题建模、迭代方案、收敛性分析与实用算法设计。该统一框架建立在带约束的双人零和马尔可夫博弈之上。我们提出了一种双重策略迭代方案,能够同时优化任务策略与安全策略,并证明了该迭代方案的收敛性。进一步地,我们设计了一种适用于实际部署的深度强化学习算法,即双重鲁棒Actor-Critic(DRAC)。在安全关键基准测试中的评估表明,DRAC在所有场景(无对抗、安全对抗、性能对抗)下均能实现高性能与持续安全保障,显著优于所有基线方法。