Prompt-tuning has emerged as an attractive paradigm for deploying large-scale language models due to its strong downstream task performance and efficient multitask serving ability. Despite its wide adoption, we empirically show that prompt-tuning is vulnerable to downstream task-agnostic backdoors, which reside in the pretrained models and can affect arbitrary downstream tasks. The state-of-the-art backdoor detection approaches cannot defend against task-agnostic backdoors since they hardly converge in reversing the backdoor triggers. To address this issue, we propose LMSanitator, a novel approach for detecting and removing task-agnostic backdoors on Transformer models. Instead of directly inverting the triggers, LMSanitator aims to invert the predefined attack vectors (pretrained models' output when the input is embedded with triggers) of the task-agnostic backdoors, which achieves much better convergence performance and backdoor detection accuracy. LMSanitator further leverages prompt-tuning's property of freezing the pretrained model to perform accurate and fast output monitoring and input purging during the inference phase. Extensive experiments on multiple language models and NLP tasks illustrate the effectiveness of LMSanitator. For instance, LMSanitator achieves 92.8% backdoor detection accuracy on 960 models and decreases the attack success rate to less than 1% in most scenarios.
翻译:提示微调已成为部署大规模语言模型的一种有前景的范式,其原因在于其强大的下游任务性能以及高效的多任务服务能力。尽管被广泛采用,但我们通过实证研究证明,提示微调容易受到下游任务无关后门的攻击,这些后门驻留在预训练模型中,可影响任意下游任务。当前最先进的后门检测方法难以防御任务无关后门,因为它们在逆向后门触发器时几乎无法收敛。为解决这一问题,我们提出LMSanitator,一种用于检测和移除Transformer模型上任务无关后门的新型方法。与直接逆向触发器不同,LMSanitator旨在逆向任务无关后门的预定义攻击向量(当输入嵌入触发器时预训练模型的输出),从而实现更优的收敛性能和更高的后门检测准确率。LMSanitator进一步利用提示微调冻结预训练模型的特性,在推理阶段执行精确且快速的输出监控与输入净化。在多种语言模型和自然语言处理任务上的大量实验表明LMSanitator的有效性。例如,LMSanitator在960个模型上实现了92.8%的后门检测准确率,并在大多数场景中将攻击成功率降至1%以下。