The wide-ranging applications of large language models (LLMs), especially in safety-critical domains, necessitate the proper evaluation of the LLM's adversarial robustness. This paper proposes an efficient tool to audit the LLM's adversarial robustness via a prompt-based adversarial attack (PromptAttack). PromptAttack converts adversarial textual attacks into an attack prompt that can cause the victim LLM to output the adversarial sample to fool itself. The attack prompt is composed of three important components: (1) original input (OI) including the original sample and its ground-truth label, (2) attack objective (AO) illustrating a task description of generating a new sample that can fool itself without changing the semantic meaning, and (3) attack guidance (AG) containing the perturbation instructions to guide the LLM on how to complete the task by perturbing the original sample at character, word, and sentence levels, respectively. Besides, we use a fidelity filter to ensure that PromptAttack maintains the original semantic meanings of the adversarial examples. Further, we enhance the attack power of PromptAttack by ensembling adversarial examples at different perturbation levels. Comprehensive empirical results using Llama2 and GPT-3.5 validate that PromptAttack consistently yields a much higher attack success rate compared to AdvGLUE and AdvGLUE++. Interesting findings include that a simple emoji can easily mislead GPT-3.5 to make wrong predictions.
翻译:大语言模型(LLM)的广泛应用,特别是在安全关键领域,要求对LLM的对抗鲁棒性进行恰当评估。本文提出了一种高效工具,通过基于提示的对抗性攻击(PromptAttack)来审计LLM的对抗鲁棒性。PromptAttack将文本对抗性攻击转化为一种攻击提示,使受害LLM输出对抗样本以欺骗自身。该攻击提示包含三个重要组成部分:(1)原始输入(OI),包括原始样本及其真实标签;(2)攻击目标(AO),描述生成新样本的任务,该样本能在不改变语义的前提下欺骗自身;(3)攻击指导(AG),包含扰动指令,指导LLM如何分别在字符、单词和句子层级扰动原始样本以完成任务。此外,我们使用了保真度过滤器确保PromptAttack维持对抗样本的原始语义。进一步地,我们通过集成不同扰动层级的对抗样本来增强PromptAttack的攻击能力。使用Llama2和GPT-3.5的全面实证结果表明,与AdvGLUE和AdvGLUE++相比,PromptAttack始终能实现更高的攻击成功率。有趣发现包括:一个简单的表情符号就能轻易误导GPT-3.5做出错误预测。