A variety of defenses have been proposed against backdoors attacks on deep neural network (DNN) classifiers. Universal methods seek to reliably detect and/or mitigate backdoors irrespective of the incorporation mechanism used by the attacker, while reverse-engineering methods often explicitly assume one. In this paper, we describe a new detector that: relies on internal feature map of the defended DNN to detect and reverse-engineer the backdoor and identify its target class; can operate post-training (without access to the training dataset); is highly effective for various incorporation mechanisms (i.e., is universal); and which has low computational overhead and so is scalable. Our detection approach is evaluated for different attacks on a benchmark CIFAR-10 image classifier.
翻译:针对深度神经网络(DNN)分类器的后门攻击,已有多种防御方案被提出。通用方法旨在可靠地检测和/或缓解后门,且不依赖于攻击者使用的嵌入机制,而逆向工程方法通常显式假定一种机制。本文提出一种新型检测器,其特点包括:依赖受保护DNN的内部特征图来检测并逆向工程后门,同时识别其目标类别;可在后训练阶段运行(无需访问训练数据集);对于多种嵌入机制均高度有效(即具有通用性);且计算开销低,具有良好的可扩展性。该检测方法在基准CIFAR-10图像分类器上针对不同攻击进行了评估。