Multimodal contrastive learning uses various data modalities to create high-quality features, but its reliance on extensive data sources on the Internet makes it vulnerable to backdoor attacks. These attacks insert malicious behaviors during training, which are activated by specific triggers during inference, posing significant security risks. Despite existing countermeasures through fine-tuning that reduce the malicious impacts of such attacks, these defenses frequently necessitate extensive training time and degrade clean accuracy. In this study, we propose an efficient defense mechanism against backdoor threats using a concept known as machine unlearning. This entails strategically creating a small set of poisoned samples to aid the model's rapid unlearning of backdoor vulnerabilities, known as Unlearn Backdoor Threats (UBT). We specifically use overfit training to improve backdoor shortcuts and accurately detect suspicious samples in the potential poisoning data set. Then, we select fewer unlearned samples from suspicious samples for rapid forgetting in order to eliminate the backdoor effect and thus improve backdoor defense efficiency. In the backdoor unlearning process, we present a novel token-based portion unlearning training regime. This technique focuses on the model's compromised elements, dissociating backdoor correlations while maintaining the model's overall integrity. Extensive experimental results show that our method effectively defends against various backdoor attack methods in the CLIP model. Compared to SoTA backdoor defense methods, UBT achieves the lowest attack success rate while maintaining a high clean accuracy of the model (attack success rate decreases by 19% compared to SOTA, while clean accuracy increases by 2.57%).
翻译:多模态对比学习利用多种数据模态生成高质量特征,但其对互联网海量数据源的依赖使其易受后门攻击。此类攻击在训练阶段植入恶意行为,并通过特定触发器在推理阶段激活,构成严重安全威胁。尽管现有基于微调的防御方法能够降低攻击的恶意影响,但这些防御通常需要大量训练时间且会损害模型在干净数据上的准确率。本研究提出一种基于机器遗忘概念的高效后门威胁防御机制,通过策略性构建少量中毒样本来帮助模型快速遗忘后门漏洞,称为后门威胁遗忘法。我们特别采用过拟合训练来增强后门捷径特征,从而精准识别潜在中毒数据集中的可疑样本。随后,从可疑样本中精选少量遗忘样本进行快速遗忘处理,以消除后门效应,从而提升后门防御效率。在后门遗忘过程中,我们提出一种新颖的基于令牌的分区遗忘训练机制。该技术聚焦于模型中受感染的部分,在保持模型整体完整性的同时解耦后门关联。大量实验结果表明,我们的方法能有效防御CLIP模型中各类后门攻击。与现有最优后门防御方法相比,后门威胁遗忘法在保持模型高干净准确率的同时实现了最低的攻击成功率(攻击成功率较现有最优方法降低19%,干净准确率提升2.57%)。