Backdoor attacks inject poisoned data into the training set, resulting in misclassification of the poisoned samples during model inference. Defending against such attacks is challenging, especially in real-world black-box settings where only model predictions are available. In this paper, we propose a novel backdoor defense framework that can effectively defend against various attacks through zero-shot image purification (ZIP). Our proposed framework can be applied to black-box models without requiring any internal information about the poisoned model or any prior knowledge of the clean/poisoned samples. Our defense framework involves a two-step process. First, we apply a linear transformation on the poisoned image to destroy the trigger pattern. Then, we use a pre-trained diffusion model to recover the missing semantic information removed by the transformation. In particular, we design a new reverse process using the transformed image to guide the generation of high-fidelity purified images, which can be applied in zero-shot settings. We evaluate our ZIP backdoor defense framework on multiple datasets with different kinds of attacks. Experimental results demonstrate the superiority of our ZIP framework compared to state-of-the-art backdoor defense baselines. We believe that our results will provide valuable insights for future defense methods for black-box models.
翻译:后门攻击通过向训练集中注入恶意样本,导致模型推理时将带毒样本错误分类。防御此类攻击极具挑战性,尤其在仅有模型预测结果可用的真实黑盒场景中。本文提出了一种新颖的后门防御框架,可通过零样本图像净化(ZIP)有效抵御多种攻击。该框架无需被毒化模型的任何内部信息,也无需对干净/带毒样本具有先验知识,即可直接应用于黑盒模型。我们的防御框架包含两步流程:首先对被毒化图像施加线性变换以破坏触发器图案,随后利用预训练扩散模型恢复变换过程中丢失的语义信息。特别地,我们设计了一种基于变换图像引导的新逆向过程,可在零样本条件下生成高保真净化图像。我们在包含多种攻击类型的数据集上评估了ZIP后门防御框架,实验结果表明该框架相较于现有最优后门防御基线具有显著优势。我们相信该成果将为未来黑盒模型防御方法提供重要启示。