Backdoor attack is a common threat to deep neural networks. During testing, samples embedded with a backdoor trigger will be misclassified as an adversarial target by a backdoored model, while samples without the backdoor trigger will be correctly classified. In this paper, we present the first certified backdoor detector (CBD), which is based on a novel, adjustable conformal prediction scheme based on our proposed statistic local dominant probability. For any classifier under inspection, CBD provides 1) a detection inference, 2) the condition under which the attacks are guaranteed to be detectable for the same classification domain, and 3) a probabilistic upper bound for the false positive rate. Our theoretical results show that attacks with triggers that are more resilient to test-time noise and have smaller perturbation magnitudes are more likely to be detected with guarantees. Moreover, we conduct extensive experiments on four benchmark datasets considering various backdoor types, such as BadNet, CB, and Blend. CBD achieves comparable or even higher detection accuracy than state-of-the-art detectors, and it in addition provides detection certification. Notably, for backdoor attacks with random perturbation triggers bounded by $\ell_2\leq0.75$ which achieves more than 90\% attack success rate, CBD achieves 100\% (98\%), 100\% (84\%), 98\% (98\%), and 72\% (40\%) empirical (certified) detection true positive rates on the four benchmark datasets GTSRB, SVHN, CIFAR-10, and TinyImageNet, respectively, with low false positive rates.
翻译:后门攻击是深度神经网络面临的常见威胁。在测试阶段,嵌入后门触发器的样本会被后门模型误分类为对抗目标,而未携带后门触发器的样本则会被正确分类。本文首次提出一种可认证的后门检测器(CBD),其核心是基于我们提出的统计量——局部主导概率的新型可调节共形预测方案。对于任意待检测分类器,CBD提供以下能力:1)检测推断;2)同一分类领域下保证可检测的攻击条件;3)误报率的概率上界。理论结果表明,对测试时噪声更具鲁棒性且扰动幅度更小的触发器攻击,更易被保证检测到。此外,我们在四个基准数据集上针对多种后门类型(如BadNet、CB和Blend)进行了广泛实验。CBD在检测精度方面达到或超越现有最优检测器,并额外提供检测认证。值得注意的是,对于扰动幅度受限($\ell_2\leq0.75$)且攻击成功率超过90%的随机扰动触发器后门攻击,CBD在GTSRB、SVHN、CIFAR-10和TinyImageNet四个基准数据集上分别实现了100%(98%)、100%(84%)、98%(98%)和72%(40%)的经验(认证)检测真阳性率,且误报率保持较低水平。