Prompt-tuning has emerged as an attractive paradigm for deploying large-scale language models due to its strong downstream task performance and efficient multitask serving ability. Despite its wide adoption, we empirically show that prompt-tuning is vulnerable to downstream task-agnostic backdoors, which reside in the pretrained models and can affect arbitrary downstream tasks. The state-of-the-art backdoor detection approaches cannot defend against task-agnostic backdoors since they hardly converge in reversing the backdoor triggers. To address this issue, we propose LMSanitator, a novel approach for detecting and removing task-agnostic backdoors on Transformer models. Instead of directly inversing the triggers, LMSanitator aims to inverse the predefined attack vectors (pretrained models' output when the input is embedded with triggers) of the task-agnostic backdoors, which achieves much better convergence performance and backdoor detection accuracy. LMSanitator further leverages prompt-tuning's property of freezing the pretrained model to perform accurate and fast output monitoring and input purging during the inference phase. Extensive experiments on multiple language models and NLP tasks illustrate the effectiveness of LMSanitator. For instance, LMSanitator achieves 92.8% backdoor detection accuracy on 960 models and decreases the attack success rate to less than 1% in most scenarios.
翻译:提示微调因在下游任务中表现出色且具备高效的多任务服务能力,已成为部署大规模语言模型颇具吸引力的范式。尽管该方法被广泛采用,我们通过实验证明,提示微调易受到下游任务无关后门攻击的影响——这些后门驻留在预训练模型中,可影响任意下游任务。现有最先进的后门检测方法难以防御任务无关后门,因其在逆向生成后门触发器时几乎无法收敛。针对此问题,我们提出LMSanitator——一种用于检测并消除Transformer模型上任务无关后门的新型方法。LMSanitator并非直接逆向触发器,而是逆向任务无关后门的预定义攻击向量(即输入嵌入触发器时预训练模型的输出),从而在收敛性能和后门检测准确率上取得显著提升。此外,LMSanitator利用提示微调冻结预训练模型的特性,在推理阶段实现精确快速的输出监控与输入净化。在多个语言模型与自然语言处理任务上的大量实验证明LMSanitator的有效性。例如,LMSanitator在960个模型上达到92.8%的后门检测准确率,并在多数场景中将攻击成功率降至1%以下。