Pre-trained models (PTMs) are extensively utilized in various downstream tasks. Adopting untrusted PTMs may suffer from backdoor attacks, where the adversary can compromise the downstream models by injecting backdoors into the PTM. However, existing backdoor attacks to PTMs can only achieve partially task-agnostic and the embedded backdoors are easily erased during the fine-tuning process. In this paper, we propose a novel transferable backdoor attack, TransTroj, to simultaneously meet functionality-preserving, durable, and task-agnostic. In particular, we first formalize transferable backdoor attacks as the indistinguishability problem between poisoned and clean samples in the embedding space. We decompose the embedding indistinguishability into pre- and post-indistinguishability, representing the similarity of the poisoned and reference embeddings before and after the attack. Then, we propose a two-stage optimization that separately optimizes triggers and victim PTMs to achieve embedding indistinguishability. We evaluate TransTroj on four PTMs and six downstream tasks. Experimental results show that TransTroj significantly outperforms SOTA task-agnostic backdoor attacks (18%$\sim$99%, 68% on average) and exhibits superior performance under various system settings. The code is available at https://github.com/haowang-cqu/TransTroj .
翻译:预训练模型(PTM)被广泛应用于各类下游任务。采用不可信的PTM可能遭受后门攻击,攻击者可通过向PTM中注入后门来破坏下游模型。然而,现有针对PTM的后门攻击仅能实现部分任务无关性,且嵌入的后门在微调过程中容易被消除。本文提出一种新型可迁移后门攻击方法TransTroj,可同时满足功能保持性、持久性和任务无关性。具体而言,我们首先将可迁移后门攻击形式化为嵌入空间中中毒样本与干净样本的不可区分性问题。我们将嵌入不可区分性分解为前不可区分性与后不可区分性,分别表征攻击前后中毒嵌入与参考嵌入的相似度。随后提出两阶段优化方法,分别优化触发器与受害PTM以实现嵌入不可区分性。我们在四个PTM和六个下游任务上评估了TransTroj,实验结果表明其性能显著优于现有最优任务无关后门攻击(提升幅度18%~99%,平均68%),并在多种系统设置下展现出优越性能。代码已开源至https://github.com/haowang-cqu/TransTroj。