$\textit{LinkPrompt}$: Natural and Universal Adversarial Attacks on Prompt-based Language Models

Prompt-based learning is a new language model training paradigm that adapts the Pre-trained Language Models (PLMs) to downstream tasks, which revitalizes the performance benchmarks across various natural language processing (NLP) tasks. Instead of using a fixed prompt template to fine-tune the model, some research demonstrates the effectiveness of searching for the prompt via optimization. Such prompt optimization process of prompt-based learning on PLMs also gives insight into generating adversarial prompts to mislead the model, raising concerns about the adversarial vulnerability of this paradigm. Recent studies have shown that universal adversarial triggers (UATs) can be generated to alter not only the predictions of the target PLMs but also the prediction of corresponding Prompt-based Fine-tuning Models (PFMs) under the prompt-based learning paradigm. However, UATs found in previous works are often unreadable tokens or characters and can be easily distinguished from natural texts with adaptive defenses. In this work, we consider the naturalness of the UATs and develop $\textit{LinkPrompt}$, an adversarial attack algorithm to generate UATs by a gradient-based beam search algorithm that not only effectively attacks the target PLMs and PFMs but also maintains the naturalness among the trigger tokens. Extensive results demonstrate the effectiveness of $\textit{LinkPrompt}$, as well as the transferability of UATs generated by $\textit{LinkPrompt}$ to open-sourced Large Language Model (LLM) Llama2 and API-accessed LLM GPT-3.5-turbo. The resource is available at $\href{https://github.com/SavannahXu79/LinkPrompt}{https://github.com/SavannahXu79/LinkPrompt}$.

翻译：基于提示的学习是一种新的语言模型训练范式，它使预训练语言模型（PLMs）适应下游任务，在各类自然语言处理（NLP）任务中复兴了性能基准。不同于使用固定提示模板对模型进行微调，一些研究证明了通过优化搜索提示的有效性。这种基于提示学习的PLMs优化过程也为生成误导模型的对抗性提示提供了思路，引发了对该范式对抗脆弱性的关注。最新研究表明，针对提示学习范式，不仅能够生成通用对抗触发器（UATs）以改变目标PLMs的预测结果，还能改变对应提示微调模型（PFMs）的预测。然而，以往研究中发现的UATs多为不可读的标记或字符，易被自适应防御机制从自然文本中区分。本文考虑了UATs的自然性，提出了$\textit{LinkPrompt}$对抗攻击算法——通过基于梯度的束搜索算法生成UATs，既能有效攻击目标PLMs和PFMs，又能保持触发词标记间的自然性。广泛实验证明了$\textit{LinkPrompt}$的有效性，以及其生成的UATs对开源大语言模型（LLM）Llama2和API访问的LLM GPT-3.5-turbo的迁移性。相关资源已开源至$\href{https://github.com/SavannahXu79/LinkPrompt}{https://github.com/SavannahXu79/LinkPrompt}$。