Backdoor attacks pose an increasingly severe security threat to Deep Neural Networks (DNNs) during their development stage. In response, backdoor sample purification has emerged as a promising defense mechanism, aiming to eliminate backdoor triggers while preserving the integrity of the clean content in the samples. However, existing approaches have been predominantly focused on the word space, which are ineffective against feature-space triggers and significantly impair performance on clean data. To address this, we introduce a universal backdoor defense that purifies backdoor samples in the activation space by drawing abnormal activations towards optimized minimum clean activation distribution intervals. The advantages of our approach are twofold: (1) By operating in the activation space, our method captures from surface-level information like words to higher-level semantic concepts such as syntax, thus counteracting diverse triggers; (2) the fine-grained continuous nature of the activation space allows for more precise preservation of clean content while removing triggers. Furthermore, we propose a detection module based on statistical information of abnormal activations, to achieve a better trade-off between clean accuracy and defending performance.
翻译:后门攻击在深度神经网络(DNNs)的开发阶段对其构成了日益严重的安全威胁。为此,后门样本净化作为一种有前景的防御机制应运而生,旨在消除后门触发器的同时保留样本中干净内容的完整性。然而,现有方法主要聚焦于词空间,这不仅对特征空间触发器无效,还会显著损害干净数据的性能。为解决这一问题,我们提出了一种通用后门防御方法,通过在激活空间中将异常激活拉向优化的最小干净激活分布区间来净化后门样本。我们的方法具有双重优势:(1)通过在激活空间中进行操作,我们的方法能从词汇等表面信息捕获句法等更高层次的语义概念,从而抵御多样化触发器;(2)激活空间细粒度的连续性允许在移除触发器的同时更精确地保留干净内容。此外,我们基于异常激活的统计信息提出了一种检测模块,以实现干净准确率与防御性能之间更好的权衡。