Prompt-based learning has been proved to be an effective way in pre-trained language models (PLMs), especially in low-resource scenarios like few-shot settings. However, the trustworthiness of PLMs is of paramount significance and potential vulnerabilities have been shown in prompt-based templates that could mislead the predictions of language models, causing serious security concerns. In this paper, we will shed light on some vulnerabilities of PLMs, by proposing a prompt-based adversarial attack on manual templates in black box scenarios. First of all, we design character-level and word-level heuristic approaches to break manual templates separately. Then we present a greedy algorithm for the attack based on the above heuristic destructive approaches. Finally, we evaluate our approach with the classification tasks on three variants of BERT series models and eight datasets. And comprehensive experimental results justify the effectiveness of our approach in terms of attack success rate and attack speed.
翻译:提示学习已被证明是预训练语言模型中的有效方法,尤其在少样本等低资源场景下。然而,预训练语言模型的可信度至关重要,现有研究显示提示模板存在潜在漏洞,可能误导语言模型预测,引发严重的安全隐患。本文通过提出一种黑盒场景下针对手动模板的提示对抗攻击,揭示预训练语言模型的若干脆弱性。首先,我们分别设计字符级和词级启发式方法以破坏手动模板;其次,基于上述启发式破坏方法提出一种贪婪攻击算法;最后,采用三类BERT系列模型变体及八个数据集上的分类任务评估我们的方法。综合实验结果从攻击成功率与攻击速度两方面验证了本方法的有效性。