In-context learning (ICL) has emerged as a powerful paradigm leveraging LLMs for specific downstream tasks by utilizing labeled examples as demonstrations (demos) in the precondition prompts. Despite its promising performance, ICL suffers from instability with the choice and arrangement of examples. Additionally, crafted adversarial attacks pose a notable threat to the robustness of ICL. However, existing attacks are either easy to detect, rely on external models, or lack specificity towards ICL. This work introduces a novel transferable attack against ICL to address these issues, aiming to hijack LLMs to generate the target response or jailbreak. Our hijacking attack leverages a gradient-based prompt search method to learn and append imperceptible adversarial suffixes to the in-context demos without directly contaminating the user queries. Comprehensive experimental results across different generation and jailbreaking tasks highlight the effectiveness of our hijacking attack, resulting in distracted attention towards adversarial tokens and consequently leading to unwanted target outputs. We also propose a defense strategy against hijacking attacks through the use of extra clean demos, which enhances the robustness of LLMs during ICL. Broadly, this work reveals the significant security vulnerabilities of LLMs and emphasizes the necessity for in-depth studies on their robustness.
翻译:上下文学习(ICL)已成为一种强大的范式,它通过在前提提示中使用带标签的示例作为演示(demos),使LLMs能够执行特定的下游任务。尽管其性能前景广阔,但ICL的表现会因示例的选择和排列方式而呈现不稳定性。此外,精心设计的对抗性攻击对ICL的鲁棒性构成了显著威胁。然而,现有的攻击方法要么易于检测,要么依赖外部模型,或缺乏针对ICL的特异性。本研究针对这些问题,提出了一种新颖的可迁移攻击方法,旨在劫持LLMs以生成目标响应或实现越狱。我们的劫持攻击利用基于梯度的提示搜索方法,学习并将难以察觉的对抗性后缀附加到上下文演示中,而无需直接污染用户查询。在不同生成任务和越狱任务上的综合实验结果凸显了我们劫持攻击的有效性,它导致模型注意力分散到对抗性标记上,进而产生非期望的目标输出。我们还提出了一种防御策略,通过使用额外的干净演示来增强LLMs在ICL过程中的鲁棒性,以抵御劫持攻击。总体而言,这项工作揭示了LLMs存在的重大安全漏洞,并强调了深入研究其鲁棒性的必要性。