We present a novel defense, against backdoor attacks on Deep Neural Networks (DNNs), wherein adversaries covertly implant malicious behaviors (backdoors) into DNNs. Our defense falls within the category of post-development defenses that operate independently of how the model was generated. The proposed defense is built upon a novel reverse engineering approach that can directly extract backdoor functionality of a given backdoored model to a backdoor expert model. The approach is straightforward -- finetuning the backdoored model over a small set of intentionally mislabeled clean samples, such that it unlearns the normal functionality while still preserving the backdoor functionality, and thus resulting in a model (dubbed a backdoor expert model) that can only recognize backdoor inputs. Based on the extracted backdoor expert model, we show the feasibility of devising highly accurate backdoor input detectors that filter out the backdoor inputs during model inference. Further augmented by an ensemble strategy with a finetuned auxiliary model, our defense, BaDExpert (Backdoor Input Detection with Backdoor Expert), effectively mitigates 17 SOTA backdoor attacks while minimally impacting clean utility. The effectiveness of BaDExpert has been verified on multiple datasets (CIFAR10, GTSRB and ImageNet) across various model architectures (ResNet, VGG, MobileNetV2 and Vision Transformer).
翻译:我们提出了一种针对深度神经网络(DNN)后门攻击的新型防御方法,其中攻击者将恶意行为(后门)隐蔽地植入DNN。该防御属于不依赖模型生成过程的开发后防御范畴。所提出的防御基于一种新颖的逆向工程技术,能够直接从给定的带后门模型中提取后门功能至一个后门专家模型。该方法简单直接——在少量故意错误标记的干净样本上微调带后门模型,使其遗忘正常功能而仅保留后门功能,从而得到一个只能识别后门输入的模型(称为后门专家模型)。基于提取的后门专家模型,我们展示了构建高精度后门输入检测器的可行性,该检测器可在模型推理过程中过滤后门输入。进一步结合与微调辅助模型的集成策略,我们的防御方法BaDExpert(基于后门专家的后门输入检测)有效缓解了17种最新后门攻击,同时对干净样本效用影响极小。BaDExpert的有效性已在多个数据集(CIFAR10、GTSRB和ImageNet)及多种模型架构(ResNet、VGG、MobileNetV2和Vision Transformer)上得到验证。