Prompt-based learning is a new language model training paradigm that adapts the Pre-trained Language Models (PLMs) to downstream tasks, which revitalizes the performance benchmarks across various natural language processing (NLP) tasks. Instead of using a fixed prompt template to fine-tune the model, some research demonstrates the effectiveness of searching for the prompt via optimization. Such prompt optimization process of prompt-based learning on PLMs also gives insight into generating adversarial prompts to mislead the model, raising concerns about the adversarial vulnerability of this paradigm. Recent studies have shown that universal adversarial triggers (UATs) can be generated to alter not only the predictions of the target PLMs but also the prediction of corresponding Prompt-based Fine-tuning Models (PFMs) under the prompt-based learning paradigm. However, UATs found in previous works are often unreadable tokens or characters and can be easily distinguished from natural texts with adaptive defenses. In this work, we consider the naturalness of the UATs and develop $\textit{LinkPrompt}$, an adversarial attack algorithm to generate UATs by a gradient-based beam search algorithm that not only effectively attacks the target PLMs and PFMs but also maintains the naturalness among the trigger tokens. Extensive results demonstrate the effectiveness of $\textit{LinkPrompt}$, as well as the transferability of UATs generated by \textit{LinkPrompt} to open-sourced Large Language Model (LLM) Llama2 and API-accessed LLM GPT-3.5-turbo.
翻译:提示学习是一种新的语言模型训练范式,通过将预训练语言模型适配到下游任务,该方法在各种自然语言处理任务中重振了性能基准。与使用固定提示模板微调模型不同,一些研究证实了通过优化搜索提示的有效性。这种基于提示学习的PLM优化过程也为生成误导模型的对抗性提示提供了思路,引发了对该范式对抗脆弱性的担忧。近期研究表明,通用对抗触发器不仅能够改变目标PLM的预测结果,还能改变提示学习范式下对应提示微调模型的预测。然而,先前工作中发现的UATs通常是不可读的标记或字符,容易通过自适应防御与自然文本区分。本研究考虑UATs的自然性,提出$\textit{LinkPrompt}$——一种基于梯度束搜索算法的对抗攻击方法,该方法不仅能有效攻击目标PLM和PFMs,还能保持触发标记的自然性。大量实验结果证明了$\textit{LinkPrompt}$的有效性,以及该算法生成的UATs对开源大语言模型Llama2和API访问的LLM GPT-3.5-turbo的迁移性。