Backdoor (Trojan) attack is a common threat to deep neural networks, where samples from one or more source classes embedded with a backdoor trigger will be misclassified to adversarial target classes. Existing methods for detecting whether a classifier is backdoor attacked are mostly designed for attacks with a single adversarial target (e.g., all-to-one attack). To the best of our knowledge, without supervision, no existing methods can effectively address the more general X2X attack with an arbitrary number of source classes, each paired with an arbitrary target class. In this paper, we propose UMD, the first Unsupervised Model Detection method that effectively detects X2X backdoor attacks via a joint inference of the adversarial (source, target) class pairs. In particular, we first define a novel transferability statistic to measure and select a subset of putative backdoor class pairs based on a proposed clustering approach. Then, these selected class pairs are jointly assessed based on an aggregation of their reverse-engineered trigger size for detection inference, using a robust and unsupervised anomaly detector we proposed. We conduct comprehensive evaluations on CIFAR-10, GTSRB, and Imagenette dataset, and show that our unsupervised UMD outperforms SOTA detectors (even with supervision) by 17%, 4%, and 8%, respectively, in terms of the detection accuracy against diverse X2X attacks. We also show the strong detection performance of UMD against several strong adaptive attacks.
翻译:后门(木马)攻击是深度神经网络面临的常见威胁,当嵌入后门触发器的样本来自一个或多个源类别时,会被误分类至对抗性目标类别。现有检测分类器是否遭受后门攻击的方法大多针对单一对抗性目标(如全目标攻击)设计。据我们所知,目前尚无监督方法能有效处理更通用的X2X攻击——该攻击包含任意数量的源类别,每个源类别对应任意目标类别。本文提出UMD,首个通过联合推断对抗性(源,目标)类别对来有效检测X2X后门攻击的无监督模型检测方法。具体而言,我们首先定义一种新颖的可迁移性统计量,基于所提出的聚类方法测量并选取一组疑似后门类别对。随后,通过我们提出的鲁棒无监督异常检测器,根据这些类别对逆向工程得到的触发器尺寸聚合结果进行联合评估,从而完成检测推断。我们在CIFAR-10、GTSRB和Imagenette数据集上开展全面评估,结果表明:针对多种X2X攻击,我们的无监督UMD方法在检测准确率上分别比现有最优检测器(包括有监督方法)高出17%、4%和8%。我们还展示了UMD在对抗多种强自适应攻击时的优异检测性能。