We present a novel defense, against backdoor attacks on Deep Neural Networks (DNNs), wherein adversaries covertly implant malicious behaviors (backdoors) into DNNs. Our defense falls within the category of post-development defenses that operate independently of how the model was generated. The proposed defense is built upon a novel reverse engineering approach that can directly extract backdoor functionality of a given backdoored model to a backdoor expert model. The approach is straightforward -- finetuning the backdoored model over a small set of intentionally mislabeled clean samples, such that it unlearns the normal functionality while still preserving the backdoor functionality, and thus resulting in a model (dubbed a backdoor expert model) that can only recognize backdoor inputs. Based on the extracted backdoor expert model, we show the feasibility of devising highly accurate backdoor input detectors that filter out the backdoor inputs during model inference. Further augmented by an ensemble strategy with a finetuned auxiliary model, our defense, BaDExpert (Backdoor Input Detection with Backdoor Expert), effectively mitigates 16 SOTA backdoor attacks while minimally impacting clean utility. The effectiveness of BaDExpert has been verified on multiple datasets (CIFAR10, GTSRB and ImageNet) across various model architectures (ResNet, VGG, MobileNetV2 and Vision Transformer).
翻译:我们提出一种针对深度神经网络后门攻击的新型防御方法,其中攻击者会隐秘地将恶意行为(后门)植入深度神经网络中。该防御属于后开发防御范畴,其运作方式独立于模型的生成过程。所提出的防御建立于一种新颖的逆向工程技术之上,该技术可直接将给定带后门模型的后门功能提取至一个后门专家模型。该方法简单直接——通过少量故意错误标注的干净样本对带后门模型进行微调,使其遗忘正常功能而保留后门功能,从而生成一个仅能识别后门输入的模型(称为后门专家模型)。基于提取的后门专家模型,我们展示了构建高精度后门输入检测器的可行性,该检测器可在模型推理过程中过滤掉后门输入。进一步通过集成策略与微调的辅助模型相结合,我们的防御方法BaDExpert(基于后门专家的后门输入检测)能有效缓解16种最先进的后门攻击,同时对干净样本的效用影响极小。BaDExpert的有效性已在多个数据集(CIFAR10、GTSRB和ImageNet)以及多种模型架构(ResNet、VGG、MobileNetV2和Vision Transformer)上得到验证。