Diffusion models are state-of-the-art deep learning empowered generative models that are trained based on the principle of learning forward and reverse diffusion processes via progressive noise-addition and denoising. To gain a better understanding of the limitations and potential risks, this paper presents the first study on the robustness of diffusion models against backdoor attacks. Specifically, we propose BadDiffusion, a novel attack framework that engineers compromised diffusion processes during model training for backdoor implantation. At the inference stage, the backdoored diffusion model will behave just like an untampered generator for regular data inputs, while falsely generating some targeted outcome designed by the bad actor upon receiving the implanted trigger signal. Such a critical risk can be dreadful for downstream tasks and applications built upon the problematic model. Our extensive experiments on various backdoor attack settings show that BadDiffusion can consistently lead to compromised diffusion models with high utility and target specificity. Even worse, BadDiffusion can be made cost-effective by simply finetuning a clean pre-trained diffusion model to implant backdoors. We also explore some possible countermeasures for risk mitigation. Our results call attention to potential risks and possible misuse of diffusion models. Our code is available on https://github.com/IBM/BadDiffusion.
翻译:扩散模型是基于学习前向和反向扩散过程(通过逐步添加噪声和去噪)的最先进的深度学习生成模型。为深入理解其局限性和潜在风险,本文首次针对扩散模型在后门攻击下的鲁棒性展开研究。具体而言,我们提出了一种新型攻击框架BadDiffusion,该框架通过在模型训练阶段设计受篡改的扩散过程来实现后门植入。在推理阶段,被植入后门的扩散模型对常规数据输入会表现得如同未受篡改的生成器,但一旦接收到攻击者预设的触发信号,便会错误生成目标输出。此类关键风险将严重威胁基于该问题模型的下游任务与应用。我们在多种后门攻击设置下进行的广泛实验表明,BadDiffusion能够持续生成兼具高实用性与高目标特异性(即仅针对触发信号产生特定输出)的受篡改扩散模型。更甚者,仅需对预训练的纯净扩散模型进行简单微调即可实现低成本的BadDiffusion后门植入。我们还探索了若干潜在的风险缓解对策。本研究结果警示人们关注扩散模型的潜在风险与可能滥用。相关代码已开源至https://github.com/IBM/BadDiffusion。