Safe reinforcement learning (Safe RL) aims to ensure policy performance while satisfying safety constraints. However, most existing Safe RL methods assume benign environments, making them vulnerable to adversarial perturbations commonly encountered in real-world settings. In addition, existing gradient-based adversarial attacks typically require access to the policy's gradient information, which is often impractical in real-world scenarios. To address these challenges, we propose an adversarial attack framework to reveal vulnerabilities of Safe RL policies. Using expert demonstrations and black-box environment interaction, our framework learns a constraint model and a surrogate (learner) policy, enabling gradient-based attack optimization without requiring the victim policy's internal gradients or the ground-truth safety constraints. We further provide theoretical analysis establishing feasibility and deriving perturbation bounds. Experiments on multiple Safe RL benchmarks demonstrate the effectiveness of our approach under limited privileged access.
翻译:安全强化学习(Safe RL)旨在确保策略性能的同时满足安全性约束。然而,现有大多数安全强化学习方法假设环境是良性的,这使得它们在现实场景中常见的对抗性扰动面前显得脆弱。此外,现有的基于梯度的对抗攻击通常需要获取策略的梯度信息,这在现实场景中往往不切实际。为应对这些挑战,我们提出了一个对抗攻击框架,以揭示安全强化学习策略的脆弱性。通过利用专家演示和黑盒环境交互,我们的框架学习了一个约束模型和一个代理(学习器)策略,从而能够在无需受害者策略内部梯度或真实安全约束的情况下,实现基于梯度的攻击优化。我们进一步提供了理论分析,确立了攻击的可行性并推导了扰动边界。在多个安全强化学习基准测试上的实验表明,我们的方法在有限特权访问条件下具有显著的有效性。