Foundation model has become the backbone of the AI ecosystem. In particular, a foundation model can be used as a general-purpose feature extractor to build various downstream classifiers. However, foundation models are vulnerable to backdoor attacks and a backdoored foundation model is a single-point-of-failure of the AI ecosystem, e.g., multiple downstream classifiers inherit the backdoor vulnerabilities simultaneously. In this work, we propose Mudjacking, the first method to patch foundation models to remove backdoors. Specifically, given a misclassified trigger-embedded input detected after a backdoored foundation model is deployed, Mudjacking adjusts the parameters of the foundation model to remove the backdoor. We formulate patching a foundation model as an optimization problem and propose a gradient descent based method to solve it. We evaluate Mudjacking on both vision and language foundation models, eleven benchmark datasets, five existing backdoor attacks, and thirteen adaptive backdoor attacks. Our results show that Mudjacking can remove backdoor from a foundation model while maintaining its utility.
翻译:基础模型已成为AI生态系统的支柱。特别是,基础模型可作为通用特征提取器,用于构建各类下游分类器。然而,基础模型易受后门攻击,存在后门的基础模型会成为AI生态系统的单点故障——例如,多个下游分类器会同时继承后门漏洞。本文提出Mudjacking方法,这是首个通过修补基础模型来移除后门的技术方案。具体而言,当部署后的后门基础模型检测到被错误分类的触发器嵌入输入时,Mudjacking会调整基础模型参数以消除后门。我们将基础模型修补问题形式化为优化问题,并提出了基于梯度下降的求解方法。我们在视觉和语言基础模型、十一个基准数据集、五种现有后门攻击及十三种自适应后门攻击上进行了评估。实验结果表明,Mudjacking能在保持基础模型实用性的同时有效移除后门。