Prompt learning is a new machine learning paradigm that has attracted ample attention due to its simplicity and proven efficacy. Despite its growing adoption, the security vulnerabilities associated with this paradigm remain underexplored. In this work, we take the first step to propose BadBone, a stealthy and adaptive backdoor attack against prompt learning using bi-level optimization. Instead of backdooring the prompt learning process, we aim to compromise a backbone model such that only target downstream tasks employing prompt learning inherit the backdoor vulnerability. Extensive experiments on three different models and three datasets from various domains show that our targeted/untargeted backdoored models achieve high attack performance while maintaining utility on both pre-training and downstream tasks. Moreover, we evaluate our approach against six state-of-the-art model-level defenses, including Neural Cleanse, ABS, MNTD, NAD, CLP, and D-BR. The results demonstrate that these defenses are largely ineffective against our backdoored models and thus leave the effective defense as an important direction for future work.
翻译:提示学习是一种新型机器学习范式,因其简洁性和已验证的有效性而备受关注。尽管该范式应用日益广泛,但其相关安全漏洞仍未得到充分研究。本文首次提出BadBone——一种基于双层优化的隐蔽自适应后门攻击方法。不同于攻击提示学习过程本身,我们旨在破坏骨干模型,使得仅采用提示学习的目标下游任务继承后门漏洞。在三种不同模型及多个领域的数据集上开展的大量实验表明:我们的定向/非定向后门模型在保持预训练任务与下游任务效用的同时,实现了高攻击效能。此外,我们针对六种最先进的模型级防御方法(包括Neural Cleanse、ABS、MNTD、NAD、CLP和D-BR)进行评估,结果表明这些防御方法对后门模型基本无效,因此设计有效防御策略将成为未来研究的重要方向。