Backdoor unlearning aims to remove a malicious trigger behavior from a deployed model while preserving clean utility. We study the update-free inference-time setting, where model parameters remain frozen. First, we audit a common projection assumption under oracle paired clean and triggered features. Projection succeeds mainly on BadNets and leaves WaNet, Blended, and SIG at 0.683, 0.888, and 0.941 ASR on CIFAR-10 ResNet-18. This failure is not explained by spectral compactness, spatial locality, or subspace misalignment. It is predicted by a logit-triplet gap involving the target margin, target-logit drop, and non-target logit rise. We then introduce InstantForget, a clean-calibrated gated reset that flags anomalous features with a Mahalanobis score and moves only flagged features toward a neutral non-target representation. With one fixed operating point selected on held-out triggered validation, InstantForget reduces average ASR to 0.071 across four non-adaptive CIFAR-10 triggers without triggered samples or parameter updates at deployment. It also reaches 0.981 detection AUROC and transfers to six of eight tested backbones. Reported failures under WaNet, ModelNet10 point blend, two backbone geometries, and adaptive feature-compactness attacks define the method's scope.
翻译:后门遗忘旨在移除已部署模型中的恶意触发行为,同时保持其干净数据的效用。我们研究了无需参数更新的推理时设置,即模型参数保持冻结。首先,在假设存在成对干净特征与触发特征的理想条件下,我们检验了一种常见的投影假设。投影主要在BadNets上有效,而在CIFAR-10 ResNet-18上,针对WaNet、Blended和SIG的攻击成功率(ASR)分别仅为0.683、0.888和0.941。这种失败无法通过谱紧致性、空间局部性或子空间错位来解释,而是由涉及目标边界、目标对数下降值和非目标对数上升值的对数三元组差距所预测。随后,我们提出了InstantForget方法,这是一种基于干净样本校准的门控重置机制:利用马氏距离标记异常特征,并仅将标记特征向中性非目标表征方向移动。通过在预留的触发验证集上选择单一固定操作点,InstantForget在部署阶段无需触发样本或参数更新,即可将CIFAR-10上四种非自适应触发的平均ASR降低至0.071。该方法同时达到了0.981的检测AUROC,并在八个测试骨干网络中的六个上实现了迁移。实验报告中的失败案例(涵盖WaNet、ModelNet10点混合触发、两种骨干网络结构及自适应特征紧致性攻击)界定了本方法的适用范围。