The widespread adoption of deep learning across various industries has introduced substantial challenges, particularly in terms of model explainability and security. The inherent complexity of deep learning models, while contributing to their effectiveness, also renders them susceptible to adversarial attacks. Among these, backdoor attacks are especially concerning, as they involve surreptitiously embedding specific triggers within training data, causing the model to exhibit aberrant behavior when presented with input containing the triggers. Such attacks often exploit vulnerabilities in outsourced processes, compromising model integrity without affecting performance on clean (trigger-free) input data. In this paper, we present a comprehensive review of existing mitigation strategies designed to counter backdoor attacks in image recognition. We provide an in-depth analysis of the theoretical foundations, practical efficacy, and limitations of these approaches. In addition, we conduct an extensive benchmarking of sixteen state-of-the-art approaches against eight distinct backdoor attacks, utilizing three datasets, four model architectures, and three poisoning ratios. Our results, derived from 122,236 individual experiments, indicate that while many approaches provide some level of protection, their performance can vary considerably. Furthermore, when compared to two seminal approaches, most newer approaches do not demonstrate substantial improvements in overall performance or consistency across diverse settings. Drawing from these findings, we propose potential directions for developing more effective and generalizable defensive mechanisms in the future.
翻译:深度学习在各行业的广泛应用带来了重大挑战,尤其是在模型可解释性与安全性方面。深度学习模型固有的复杂性在提升其效能的同时,也使其易受对抗攻击的影响。其中,后门攻击尤其值得关注,这类攻击通过在训练数据中隐秘嵌入特定触发器,使模型在接收到包含触发器的输入时表现出异常行为。此类攻击常利用外包流程中的漏洞,在不影响干净(无触发器)输入数据性能的前提下破坏模型完整性。本文系统综述了当前针对图像识别后门攻击的缓解策略,深入分析了这些方法的理论基础、实际效能与局限性。此外,我们采用三个数据集、四种模型架构及三种投毒比例,对十六种前沿方法与八种不同的后门攻击进行了大规模基准测试。基于122,236次独立实验的结果表明,尽管多数方法能提供一定程度的防护,但其性能表现存在显著差异。与两种开创性方法相比,大多数新方法在不同场景下的整体性能与一致性并未展现出实质性提升。基于这些发现,我们为未来开发更有效、更具泛化能力的防御机制提出了潜在研究方向。