Diffusion models (DM) have become state-of-the-art generative models because of their capability to generate high-quality images from noises without adversarial training. However, they are vulnerable to backdoor attacks as reported by recent studies. When a data input (e.g., some Gaussian noise) is stamped with a trigger (e.g., a white patch), the backdoored model always generates the target image (e.g., an improper photo). However, effective defense strategies to mitigate backdoors from DMs are underexplored. To bridge this gap, we propose the first backdoor detection and removal framework for DMs. We evaluate our framework Elijah on hundreds of DMs of 3 types including DDPM, NCSN and LDM, with 13 samplers against 3 existing backdoor attacks. Extensive experiments show that our approach can have close to 100% detection accuracy and reduce the backdoor effects to close to zero without significantly sacrificing the model utility.
翻译:扩散模型因其能够无需对抗训练即可从噪声生成高质量图像,已成为最先进的生成模型。然而,近期研究表明它们易受后门攻击。当数据输入(如高斯噪声)被植入触发器(如白色补丁)时,被植入后门的模型总会生成目标图像(如不当照片)。然而,针对扩散模型后门消除的有效防御策略尚待探索。为填补这一空白,我们首次提出面向扩散模型的后门检测与移除框架。我们对该框架Elijah在包含DDPM、NCSN和LDM三类共数百个扩散模型、13种采样器上进行了评估,测试其抵御三种现有后门攻击的能力。大量实验表明,我们的方法可实现接近100%的检测准确率,并将后门效应降至接近零,同时不会显著牺牲模型效用。