In recent years, the security issues of artificial intelligence have become increasingly prominent due to the rapid development of deep learning research and applications. Backdoor attack is an attack targeting the vulnerability of deep learning models, where hidden backdoors are activated by triggers embedded by the attacker, thereby outputting malicious predictions that may not align with the intended output for a given input. In this work, we propose a novel black-box backdoor attack based on machine unlearning. The attacker first augments the training set with carefully designed samples, including poison and mitigation data, to train a `benign' model. Then, the attacker posts unlearning requests for the mitigation samples to remove the impact of relevant data on the model, gradually activating the hidden backdoor. Since backdoors are implanted during the iterative unlearning process, it significantly increases the computational overhead of existing defense methods for backdoor detection or mitigation. To address this new security threat, we proposes two methods for detecting or mitigating such malicious unlearning requests. We conduct the experiment in both exact unlearning and approximate unlearning (i.e., SISA) settings. Experimental results indicate that: 1) our attack approach can successfully implant backdoor into the model, and sharding increases the difficult of attack; 2) our detection algorithms are effective in identifying the mitigation samples, while sharding reduces the effectiveness of our detection algorithms.
翻译:近年来,随着深度学习研究和应用的快速发展,人工智能的安全问题日益突出。后门攻击是一种针对深度学习模型脆弱性的攻击方式,其中隐藏的后门被攻击者嵌入的触发器激活,从而输出可能不符合给定输入预期结果的恶意预测。本文提出了一种基于机器遗忘的新型黑盒后门攻击方法。攻击者首先使用精心设计的样本(包括毒化数据和缓解数据)增强训练集,以训练一个“良性”模型。随后,攻击者针对缓解样本提交遗忘请求,以移除相关数据对模型的影响,逐步激活隐藏的后门。由于后门是在迭代遗忘过程中植入的,这显著增加了现有防御方法用于后门检测或缓解的计算开销。为应对这一新的安全威胁,我们提出了两种方法来检测或缓解此类恶意遗忘请求。我们在精确遗忘和近似遗忘(即SISA)设置下均进行了实验。实验结果表明:1)我们的攻击方法能成功将后门植入模型,且分片增加了攻击难度;2)我们的检测算法在识别缓解样本方面有效,而分片降低了检测算法的有效性。