Toward Evaluating Robustness of Reinforcement Learning with Adversarial Policy

Reinforcement learning agents are susceptible to evasion attacks during deployment. In single-agent environments, these attacks can occur through imperceptible perturbations injected into the inputs of the victim policy network. In multi-agent environments, an attacker can manipulate an adversarial opponent to influence the victim policy's observations indirectly. While adversarial policies offer a promising technique to craft such attacks, current methods are either sample-inefficient due to poor exploration strategies or require extra surrogate model training under the black-box assumption. To address these challenges, in this paper, we propose Intrinsically Motivated Adversarial Policy (IMAP) for efficient black-box adversarial policy learning in both single- and multi-agent environments. We formulate four types of adversarial intrinsic regularizers -- maximizing the adversarial state coverage, policy coverage, risk, or divergence -- to discover potential vulnerabilities of the victim policy in a principled way. We also present a novel bias-reduction method to balance the extrinsic objective and the adversarial intrinsic regularizers adaptively. Our experiments validate the effectiveness of the four types of adversarial intrinsic regularizers and the bias-reduction method in enhancing black-box adversarial policy learning across a variety of environments. Our IMAP successfully evades two types of defense methods, adversarial training and robust regularizer, decreasing the performance of the state-of-the-art robust WocaR-PPO agents by 34\%-54\% across four single-agent tasks. IMAP also achieves a state-of-the-art attacking success rate of 83.91\% in the multi-agent game YouShallNotPass. Our code is available at \url{https://github.com/x-zheng16/IMAP}.

翻译：强化学习智能体在部署阶段易受到逃逸攻击。在单智能体环境中，此类攻击可通过向受害者策略网络输入注入难以察觉的扰动实现。在多智能体环境中，攻击者可操纵对抗对手间接影响受害者策略的观测。尽管对抗策略为构建此类攻击提供了有前景的技术手段，但现有方法或因探索策略不佳导致样本效率低下，或在黑箱假设下需要额外训练替代模型。为解决这些挑战，本文提出内在动机对抗策略（IMAP），用于在单智能体与多智能体环境中实现高效的黑箱对抗策略学习。我们构建了四类对抗性内在正则项——最大化对抗状态覆盖、策略覆盖、风险或散度——以原则性方式发现受害者策略的潜在漏洞。同时提出新型偏差缩减方法，自适应平衡外在目标与对抗性内在正则项。实验验证了四类对抗性内在正则项及偏差缩减方法在多种环境中增强黑箱对抗策略学习的有效性。我们的IMAP成功规避了对抗训练与鲁棒正则化两类防御方法，在四项单智能体任务中将最先进鲁棒WocaR-PPO智能体的性能降低34%-54%。在多智能体游戏YouShallNotPass中，IMAP实现了83.91%的最优攻击成功率。代码开源地址：\url{https://github.com/x-zheng16/IMAP}。