Deep neural networks (DNNs) are vulnerable to backdoor attacks, where adversaries can maliciously trigger model misclassifications by implanting a hidden backdoor during model training. This paper proposes a simple yet effective input-level backdoor detection (dubbed IBD-PSC) as a `firewall' to filter out malicious testing images. Our method is motivated by an intriguing phenomenon, i.e., parameter-oriented scaling consistency (PSC), where the prediction confidences of poisoned samples are significantly more consistent than those of benign ones when amplifying model parameters. In particular, we provide theoretical analysis to safeguard the foundations of the PSC phenomenon. We also design an adaptive method to select BN layers to scale up for effective detection. Extensive experiments are conducted on benchmark datasets, verifying the effectiveness and efficiency of our IBD-PSC method and its resistance to adaptive attacks. Codes are available at \href{https://github.com/THUYimingLi/BackdoorBox}{BackdoorBox}.
翻译:深度神经网络(DNNs)易受后门攻击,攻击者可在模型训练过程中植入隐藏后门,从而恶意触发模型误分类。本文提出一种简单而有效的输入级后门检测方法(称为IBD-PSC),作为过滤恶意测试图像的“防火墙”。我们的方法源于一个有趣的现象,即参数导向缩放一致性(PSC):当放大模型参数时,中毒样本的预测置信度比良性样本的预测置信度表现出显著更高的一致性。具体而言,我们通过理论分析为PSC现象奠定了理论基础。我们还设计了一种自适应方法,用于选择待放大的批量归一化(BN)层以实现有效检测。在多个基准数据集上进行了大量实验,验证了IBD-PSC方法的有效性、高效性及其对自适应攻击的抵抗能力。代码发布于 \href{https://github.com/THUYimingLi/BackdoorBox}{BackdoorBox}。