While real-world applications of reinforcement learning are becoming popular, the security and robustness of RL systems are worthy of more attention and exploration. In particular, recent works have revealed that, in a multi-agent RL environment, backdoor trigger actions can be injected into a victim agent (a.k.a. Trojan agent), which can result in a catastrophic failure as soon as it sees the backdoor trigger action. To ensure the security of RL agents against malicious backdoors, in this work, we propose the problem of Backdoor Detection in a multi-agent competitive reinforcement learning system, with the objective of detecting Trojan agents as well as the corresponding potential trigger actions, and further trying to mitigate their Trojan behavior. In order to solve this problem, we propose PolicyCleanse that is based on the property that the activated Trojan agents accumulated rewards degrade noticeably after several timesteps. Along with PolicyCleanse, we also design a machine unlearning-based approach that can effectively mitigate the detected backdoor. Extensive experiments demonstrate that the proposed methods can accurately detect Trojan agents, and outperform existing backdoor mitigation baseline approaches by at least 3% in winning rate across various types of agents and environments.
翻译:尽管强化学习在现实世界中的应用日益普及,但其系统的安全性与鲁棒性仍值得更多关注与探索。特别地,近期研究揭示,在多智能体强化学习环境中,后门触发动作可被注入受害者智能体(即木马智能体),一旦该智能体观察到后门触发动作,便会导致灾难性失效。为确保强化学习智能体免受恶意后门攻击,本文提出多智能体竞争性强化学习系统中的后门检测问题,旨在检测木马智能体及其对应的潜在触发动作,并进一步尝试缓解其木马行为。为解决此问题,我们提出策略清洗方法:该方法基于被激活的木马智能体在若干时间步后累计奖励显著下降的特性。除策略清洗外,我们还设计了一种基于机器遗忘的方法,可有效缓解检测到的后门。广泛实验表明,所提方法能够准确检测木马智能体,并在各类智能体与环境中的胜率上,至少比现有后门缓解基线方法高出3%。