Deep neural networks are vulnerable to backdoor attacks (Trojans), where an attacker poisons the training set with backdoor triggers so that the neural network learns to classify test-time triggers to the attacker's designated target class. Recent work shows that backdoor poisoning induces over-fitting (abnormally large activations) in the attacked model, which motivates a general, post-training clipping method for backdoor mitigation, i.e., with bounds on internal-layer activations learned using a small set of clean samples. We devise a new such approach, choosing the activation bounds to explicitly limit classification margins. This method gives superior performance against peer methods for CIFAR-10 image classification. We also show that this method has strong robustness against adaptive attacks, X2X attacks, and on different datasets. Finally, we demonstrate a method extension for test-time detection and correction based on the output differences between the original and activation-bounded networks. The code of our method is online available.
翻译:深度神经网络易受后门攻击(木马攻击)的影响,攻击者通过后门触发器污染训练集,使神经网络学习将测试时的触发器分类至攻击者指定的目标类别。近期研究表明,后门投毒会导致被攻击模型产生过拟合(异常大的激活值),这催生了一种通用的训练后裁剪方法用于后门缓解,即利用少量干净样本学习内部层激活值的边界。我们提出了一种新方法,通过显式限制分类边际来选择激活边界。该方法在CIFAR-10图像分类任务上展现出优于同类方法的性能。我们还证明该方法对自适应攻击、X2X攻击以及不同数据集均具有较强的鲁棒性。最后,我们展示了该方法的一种扩展,基于原始网络与激活边界网络之间的输出差异实现测试时检测与修正。该方法代码已在线公开。