While real-world applications of reinforcement learning are becoming popular, the security and robustness of RL systems are worthy of more attention and exploration. In particular, recent works have revealed that, in a multi-agent RL environment, backdoor trigger actions can be injected into a victim agent (a.k.a. Trojan agent), which can result in a catastrophic failure as soon as it sees the backdoor trigger action. To ensure the security of RL agents against malicious backdoors, in this work, we propose the problem of Backdoor Detection in a multi-agent competitive reinforcement learning system, with the objective of detecting Trojan agents as well as the corresponding potential trigger actions, and further trying to mitigate their Trojan behavior. In order to solve this problem, we propose PolicyCleanse that is based on the property that the activated Trojan agents accumulated rewards degrade noticeably after several timesteps. Along with PolicyCleanse, we also design a machine unlearning-based approach that can effectively mitigate the detected backdoor. Extensive experiments demonstrate that the proposed methods can accurately detect Trojan agents, and outperform existing backdoor mitigation baseline approaches by at least 3% in winning rate across various types of agents and environments.
翻译:尽管强化学习在现实世界中的应用日益普及,其安全性与鲁棒性仍值得更多关注与探索。特别是,近期研究表明,在多智能体强化学习环境中,后门触发动作可被注入到受害智能体(即特洛伊智能体)中,一旦该智能体观察到后门触发动作,便可能导致灾难性故障。为确保强化学习智能体免受恶意后门攻击,本文提出多智能体竞争性强化学习系统中的后门检测问题,旨在检测特洛伊智能体及其对应的潜在触发动作,并进一步尝试缓解其特洛伊行为。为解决该问题,我们提出基于以下特性的PolicyCleanse方法:被激活的特洛伊智能体在若干时间步后,其累积奖励值会显著下降。配合PolicyCleanse,我们还设计了一种基于机器遗忘的方法,可有效缓解检测到的后门。大量实验表明,所提方法能够准确检测特洛伊智能体,并在各类型智能体与环境中,将胜率至少提升3%,优于现有后门缓解基线方法。