Deep neural networks are vulnerable to backdoor attacks, where an adversary maliciously manipulates the model behavior through overlaying images with special triggers. Existing backdoor defense methods often require accessing a few validation data and model parameters, which are impractical in many real-world applications, e.g., when the model is provided as a cloud service. In this paper, we address the practical task of blind backdoor defense at test time, in particular for black-box models. The true label of every test image needs to be recovered on the fly from a suspicious model regardless of image benignity. We focus on test-time image purification methods that incapacitate possible triggers while keeping semantic contents intact. Due to diverse trigger patterns and sizes, the heuristic trigger search in image space can be unscalable. We circumvent such barrier by leveraging the strong reconstruction power of generative models, and propose a framework of Blind Defense with Masked AutoEncoder (BDMAE). It detects possible triggers in the token space using image structural similarity and label consistency between the test image and MAE restorations. The detection results are then refined by considering trigger topology. Finally, we fuse MAE restorations adaptively into a purified image for making prediction. Our approach is blind to the model architectures, trigger patterns and image benignity. Extensive experiments under different backdoor settings validate its effectiveness and generalizability. Code is available at https://github.com/tsun/BDMAE.
翻译:深度神经网络易受后门攻击,攻击者通过在图像上叠加特殊触发器恶意操纵模型行为。现有后门防御方法通常需要访问少量验证数据和模型参数,这在许多实际应用场景(如模型以云服务形式提供)中难以实现。本文致力于解决测试时的盲后门防御这一实际任务,特别针对黑箱模型。无论测试图像是否为良性,均需即时从可疑模型中恢复其真实标签。我们聚焦于测试时图像净化方法,在保留语义内容完整性的同时消除潜在触发器。由于触发器模式与尺寸的多样性,在图像空间中执行启发式触发器搜索可能缺乏可扩展性。为突破此限制,我们利用生成模型的强大重构能力,提出基于掩码自编码器的盲防御框架(BDMAE)。该方法通过测试图像与MAE恢复结果之间的结构相似性和标签一致性,在词元空间中检测潜在触发器,随后结合触发器拓扑结构优化检测结果。最终将MAE恢复结果自适应融合为净化图像以执行预测。本方法对模型架构、触发器模式及图像良性程度保持盲态。在不同后门设置下的广泛实验验证了其有效性与泛化能力。代码已开源至https://github.com/tsun/BDMAE。