MM-BD: Post-Training Detection of Backdoor Attacks with Arbitrary Backdoor Pattern Types Using a Maximum Margin Statistic

Backdoor attacks are an important type of adversarial threat against deep neural network classifiers, wherein test samples from one or more source classes will be (mis)classified to the attacker's target class when a backdoor pattern is embedded. In this paper, we focus on the post-training backdoor defense scenario commonly considered in the literature, where the defender aims to detect whether a trained classifier was backdoor-attacked without any access to the training set. Many post-training detectors are designed to detect attacks that use either one or a few specific backdoor embedding functions (e.g., patch-replacement or additive attacks). These detectors may fail when the backdoor embedding function used by the attacker (unknown to the defender) is different from the backdoor embedding function assumed by the defender. In contrast, we propose a post-training defense that detects backdoor attacks with arbitrary types of backdoor embeddings, without making any assumptions about the backdoor embedding type. Our detector leverages the influence of the backdoor attack, independent of the backdoor embedding mechanism, on the landscape of the classifier's outputs prior to the softmax layer. For each class, a maximum margin statistic is estimated. Detection inference is then performed by applying an unsupervised anomaly detector to these statistics. Thus, our detector does not need any legitimate clean samples, and can efficiently detect backdoor attacks with arbitrary numbers of source classes. These advantages over several state-of-the-art methods are demonstrated on four datasets, for three different types of backdoor patterns, and for a variety of attack configurations. Finally, we propose a novel, general approach for backdoor mitigation once a detection is made. The mitigation approach was the runner-up at the first IEEE Trojan Removal Competition. The code is online available.

翻译：后门攻击是针对深度神经网络分类器的一种重要对抗性威胁：当测试样本被嵌入后门模式时，来自一个或多个源类别的样本将被（错误）分类至攻击者设定的目标类别。本文聚焦于文献中常见的后训练后门防御场景——防御方无法访问训练集，仅需检测训练后的分类器是否遭受后门攻击。现有后检测器通常针对特定后门嵌入函数（如补丁替换或加法攻击）设计，当攻击者使用的后门嵌入函数（防御方未知）与防御方假设的函数不同时，此类检测器可能失效。相比之下，我们提出一种无需假设后门嵌入类型的后训练防御方法，可检测任意后门嵌入类型的攻击。该检测器利用后门攻击对分类器softmax层前输出分布的影响（该影响与后门嵌入机制无关），为每个类别估计最大间距统计量，并通过无监督异常检测算法对这些统计量进行推理判断。因此，本检测器无需任何干净样本，且能高效检测含任意数量源类别的后门攻击。我们在四种数据集、三种后门模式类型及多种攻击配置下验证了该方法相较于多种先进方法的优势。此外，针对检测后的后门缓解问题，我们提出一种新颖的通用方法——该方法在首届IEEE木马移除竞赛中获得亚军，相关代码已开源。