A backdoor attack allows a malicious user to manipulate the environment or corrupt the training data, thus inserting a backdoor into the trained agent. Such attacks compromise the RL system's reliability, leading to potentially catastrophic results in various key fields. In contrast, relatively limited research has investigated effective defenses against backdoor attacks in RL. This paper proposes the Recovery Triggered States (RTS) method, a novel approach that effectively protects the victim agents from backdoor attacks. RTS involves building a surrogate network to approximate the dynamics model. Developers can then recover the environment from the triggered state to a clean state, thereby preventing attackers from activating backdoors hidden in the agent by presenting the trigger. When training the surrogate to predict states, we incorporate agent action information to reduce the discrepancy between the actions taken by the agent on predicted states and the actions taken on real states. RTS is the first approach to defend against backdoor attacks in a single-agent setting. Our results show that using RTS, the cumulative reward only decreased by 1.41% under the backdoor attack.
翻译:后门攻击使恶意用户能够操纵环境或破坏训练数据,从而在训练好的智能体中植入后门。这类攻击会损害强化学习系统的可靠性,在多个关键领域可能导致灾难性后果。相比之下,针对强化学习中后门攻击的有效防御研究相对有限。本文提出了恢复触发状态(RTS)方法,这是一种有效保护受害智能体免受后门攻击的新方法。RTS通过构建一个替代网络来近似动力学模型,开发者随后可将环境从触发状态恢复至干净状态,从而防止攻击者通过呈现触发器激活隐藏在智能体中的后门。在训练替代网络预测状态时,我们融入智能体的动作信息,以减少智能体在预测状态上采取的动作与在真实状态上采取的动作之间的差异。RTS是首个在单智能体场景下防御后门攻击的方法。实验结果表明,使用RTS后,后门攻击下的累积奖励仅下降1.41%。