A backdoor attack in deep learning inserts a hidden backdoor in the model to trigger malicious behavior upon specific input patterns. Existing detection approaches assume a metric space (for either the original inputs or their latent representations) in which normal samples and malicious samples are separable. We show that this assumption has a severe limitation by introducing a novel SSDT (Source-Specific and Dynamic-Triggers) backdoor, which obscures the difference between normal samples and malicious samples. To overcome this limitation, we move beyond looking for a perfect metric space that would work for different deep-learning models, and instead resort to more robust topological constructs. We propose TED (Topological Evolution Dynamics) as a model-agnostic basis for robust backdoor detection. The main idea of TED is to view a deep-learning model as a dynamical system that evolves inputs to outputs. In such a dynamical system, a benign input follows a natural evolution trajectory similar to other benign inputs. In contrast, a malicious sample displays a distinct trajectory, since it starts close to benign samples but eventually shifts towards the neighborhood of attacker-specified target samples to activate the backdoor. Extensive evaluations are conducted on vision and natural language datasets across different network architectures. The results demonstrate that TED not only achieves a high detection rate, but also significantly outperforms existing state-of-the-art detection approaches, particularly in addressing the sophisticated SSDT attack. The code to reproduce the results is made public on GitHub.
翻译:深度学习中的后门攻击会在模型中隐藏一个后门,当输入特定模式时触发恶意行为。现有检测方法假设存在一个度量空间(针对原始输入或其潜在表示),在该空间中正常样本与恶意样本可分离。我们通过引入一种新型SSDT(源特定动态触发器)后门,证明了该假设存在严重局限性——这类后门模糊了正常样本与恶意样本之间的差异。为克服这一局限,我们不再寻求适用于不同深度学习模型的完美度量空间,而是转向更具鲁棒性的拓扑结构。提出TED(拓扑演化动力学)作为模型无关的鲁棒后门检测基础。TED的核心思想是将深度学习模型视为一个将输入演化为输出的动力系统。在此系统中,良性输入遵循与其他良性输入相似的自然演化轨迹,而恶意样本则呈现不同轨迹:其初始状态接近良性样本,但最终会转向攻击者指定的目标样本邻域以激活后门。我们在不同网络架构的视觉与自然语言数据集上进行了广泛评估,结果表明TED不仅实现了高检测率,而且在应对复杂SSDT攻击方面显著优于现有最优检测方法。可复现结果的代码已在GitHub上开源。