Backdoor attacks pose significant challenges to the security of machine learning models, particularly for overparameterized models like deep neural networks. In this paper, we propose ProP (Propagation Perturbation), a novel and scalable backdoor detection method that leverages statistical output distributions to identify backdoored models and their target classes without relying on exhausive optimization strategies. ProP introduces a new metric, the benign score, to quantify output distributions and effectively distinguish between benign and backdoored models. Unlike existing approaches, ProP operates with minimal assumptions, requiring no prior knowledge of triggers or malicious samples, making it highly applicable to real-world scenarios. Extensive experimental validation across multiple popular backdoor attacks demonstrates that ProP achieves high detection accuracy and computational efficiency, outperforming existing methods. These results highlight ProP's potential as a robust and practical solution for backdoor detection.
翻译:后门攻击对机器学习模型的安全性构成重大挑战,尤其对于深度神经网络等过参数化模型。本文提出ProP(传播扰动),一种新颖且可扩展的后门检测方法,该方法利用统计输出分布来识别后门模型及其目标类别,无需依赖穷举式优化策略。ProP引入了一种新度量指标——良性分数,以量化输出分布并有效区分良性模型与后门模型。与现有方法不同,ProP在最小假设条件下运行,无需预先获知触发器或恶意样本信息,使其在现实场景中具有高度适用性。通过对多种主流后门攻击的广泛实验验证表明,ProP在实现高检测精度的同时保持了计算高效性,性能优于现有方法。这些结果凸显了ProP作为后门检测鲁棒实用解决方案的潜力。