Adversarial Demonstration Attacks on Large Language Models

With the emergence of more powerful large language models (LLMs), such as ChatGPT and GPT-4, in-context learning (ICL) has gained significant prominence in leveraging these models for specific tasks by utilizing data-label pairs as precondition prompts. While incorporating demonstrations can greatly enhance the performance of LLMs across various tasks, it may introduce a new security concern: attackers can manipulate only the demonstrations without changing the input to perform an attack. In this paper, we investigate the security concern of ICL from an adversarial perspective, focusing on the impact of demonstrations. We propose a novel attack method named advICL, which aims to manipulate only the demonstration without changing the input to mislead the models. Our results demonstrate that as the number of demonstrations increases, the robustness of in-context learning would decrease. Additionally, we also identify the intrinsic property of the demonstrations is that they can be used (prepended) with different inputs. As a result, it introduces a more practical threat model in which an attacker can attack the test input example even without knowing and manipulating it. To achieve it, we propose the transferable version of advICL, named Transferable-advICL. Our experiment shows that the adversarial demonstration generated by Transferable-advICL can successfully attack the unseen test input examples. We hope that our study reveals the critical security risks associated with ICL and underscores the need for extensive research on the robustness of ICL, particularly given its increasing significance in the advancement of LLMs.

翻译：随着ChatGPT和GPT-4等更强大大型语言模型（LLMs）的出现，上下文学习（ICL）通过利用数据-标签对作为前置提示来将这些模型应用于特定任务的做法日益凸显。尽管加入演示可以显著提升LLMs在各种任务上的性能，但也可能引入新的安全隐患：攻击者可以在不改变输入的情况下，仅操控演示内容实施攻击。本文从对抗性视角探讨ICL的安全问题，重点关注演示的影响。我们提出一种名为advICL的新型攻击方法，该方法旨在仅操控演示而不改变输入，从而误导模型。实验结果表明，随着演示数量的增加，上下文学习的鲁棒性会降低。此外，我们还发现演示的一个内在特性是它们可以与不同输入结合使用（前置添加到输入前）。这引入了一种更实际的威胁模型：攻击者可以在不知晓甚至无需操控测试输入的情况下对其进行攻击。为实现此目标，我们提出了advICL的可迁移版本，即Transferable-advICL。实验证明，由Transferable-advICL生成的对抗性演示能够成功攻击未见过的测试输入示例。我们希望本研究揭示与ICL相关的重大安全风险，并强调对ICL鲁棒性进行广泛研究的必要性，特别是在ICL对LLMs发展日益重要的背景下。