Backdoor (Trojan) attacks are an important type of adversarial exploit against deep neural networks (DNNs), wherein a test instance is (mis)classified to the attacker's target class whenever the attacker's backdoor trigger is present. In this paper, we reveal and analyze an important property of backdoor attacks: a successful attack causes an alteration in the distribution of internal layer activations for backdoor-trigger instances, compared to that for clean instances. Even more importantly, we find that instances with the backdoor trigger will be correctly classified to their original source classes if this distribution alteration is corrected. Based on our observations, we propose an efficient and effective method that achieves post-training backdoor mitigation by correcting the distribution alteration using reverse-engineered triggers. Notably, our method does not change any trainable parameters of the DNN, but achieves generally better mitigation performance than existing methods that do require intensive DNN parameter tuning. It also efficiently detects test instances with the trigger, which may help to catch adversarial entities in the act of exploiting the backdoor.
翻译:后门(特洛伊木马)攻击是对深度神经网络(DNN)的一种重要对抗性利用方式:当攻击者的后门触发器存在时,测试样本会被(错误)分类至攻击者的目标类别。本文揭示并分析了后门攻击的一个重要特性:成功攻击会导致带有后门触发器的样本在内部层激活分布上相较于干净样本发生改变。更为重要的是,我们发现若修正这种分布改变,带有后门触发器的样本将被正确分类至其原始源类别。基于这些发现,我们提出了一种高效的方法,通过利用逆向工程得到的触发器来修正分布改变,实现在训练后的后门缓解。值得注意的是,该方法无需修改DNN的任何可训练参数,却能在缓解性能上普遍优于现有需要大量DNN参数调优的方法。此外,该方法还能高效检测携带触发器的测试样本,有助于在攻击者实际利用后门时将其捕获。