Backdoor attacks pose a critical threat to machine learning models, causing them to behave normally on clean data but misclassify poisoned data into a poisoned class. Existing defenses often attempt to identify and remove backdoor neurons based on Trigger-Activated Changes (TAC) which is the activation differences between clean and poisoned data. These methods suffer from low precision in identifying true backdoor neurons due to inaccurate estimation of TAC values. In this work, we propose a novel backdoor removal method by accurately reconstructing TAC values in the latent representation. Specifically, we formulate the minimal perturbation that forces clean data to be classified into a specific class as a convex quadratic optimization problem, whose optimal solution serves as a surrogate for TAC. We then identify the poisoned class by detecting statistically small $L^2$ norms of perturbations and leverage the perturbation of the poisoned class in fine-tuning to remove backdoors. Experiments on CIFAR-10, GTSRB, and TinyImageNet demonstrated that our approach consistently achieves superior backdoor suppression with high clean accuracy across different attack types, datasets, and architectures, outperforming existing defense methods.
翻译:后门攻击对机器学习模型构成严重威胁,导致模型在干净数据上表现正常,但将中毒数据误分类至中毒类别。现有防御方法通常基于触发器激活变化(TAC)——即干净数据与中毒数据之间的激活差异——来识别并移除后门神经元。由于对TAC值的估计不准确,这些方法在识别真实后门神经元时精度较低。本研究提出一种新颖的后门移除方法,通过精确重构潜在表示中的TAC值来实现防御。具体而言,我们将迫使干净数据被分类至特定类别的最小扰动建模为凸二次优化问题,其最优解可作为TAC的替代指标。随后,通过检测扰动$L^2$范数的统计显著性来识别中毒类别,并利用该中毒类别的扰动进行微调以消除后门。在CIFAR-10、GTSRB和TinyImageNet数据集上的实验表明,我们的方法在不同攻击类型、数据集和架构下均能持续实现卓越的后门抑制效果,同时保持较高的干净数据准确率,性能优于现有防御方法。