Deep neural networks (DNNs) are vulnerable to backdoor attacks, where adversaries can maliciously trigger model misclassifications by implanting a hidden backdoor during model training. This paper proposes a simple yet effective input-level backdoor detection (dubbed IBD-PSC) as a 'firewall' to filter out malicious testing images. Our method is motivated by an intriguing phenomenon, i.e., parameter-oriented scaling consistency (PSC), where the prediction confidences of poisoned samples are significantly more consistent than those of benign ones when amplifying model parameters. In particular, we provide theoretical analysis to safeguard the foundations of the PSC phenomenon. We also design an adaptive method to select BN layers to scale up for effective detection. Extensive experiments are conducted on benchmark datasets, verifying the effectiveness and efficiency of our IBD-PSC method and its resistance to adaptive attacks.
翻译:深度神经网络(DNNs)易受后门攻击,攻击者可在模型训练期间植入隐藏后门,恶意触发模型误分类。本文提出一种简单而有效的输入级后门检测方法(称为IBD-PSC),作为过滤恶意测试图像的"防火墙"。本方法的动机源于一个有趣现象:当放大模型参数时,被投毒样本的预测置信度的一致性显著高于良性样本——即参数导向缩放一致性(PSC)。我们特别提供了理论分析以保障PSC现象的基础,并设计了自适应方法选择BN层进行缩放以实现高效检测。在基准数据集上的大量实验验证了IBD-PSC方法的有效性和效率,及其对自适应攻击的抵抗能力。