In-context learning (ICL) has emerged as a powerful paradigm leveraging LLMs for specific tasks by utilizing labeled examples as demonstrations in the precondition prompts. Despite its promising performance, ICL suffers from instability with the choice and arrangement of examples. Additionally, crafted adversarial attacks pose a notable threat to the robustness of ICL. However, existing attacks are either easy to detect, rely on external models, or lack specificity towards ICL. To address these issues, this work introduces a novel transferable attack for ICL, aiming to hijack LLMs to generate the targeted response. The proposed LLM hijacking attack leverages a gradient-based prompt search method to learn and append imperceptible adversarial suffixes to the in-context demonstrations. Extensive experimental results on various tasks and datasets demonstrate the effectiveness of our LLM hijacking attack, resulting in a distracted attention towards adversarial tokens, consequently leading to the targeted unwanted outputs.
翻译:上下文学习(ICL)已发展成为一种强大的范式,通过利用带标签示例作为预条件提示中的演示,使大型语言模型(LLM)能够胜任特定任务。尽管其性能令人瞩目,但ICL在示例的选择与排序上存在不稳定性。此外,精心设计的对抗性攻击对ICL的鲁棒性构成了显著威胁。然而,现有攻击要么易于被检测,要么依赖外部模型,要么缺乏针对ICL的特异性。为解决这些问题,本文提出一种针对ICL的新型可迁移攻击,旨在劫持LLM以生成目标响应。所提出的LLM劫持攻击利用基于梯度的提示搜索方法,学习并附加难以察觉的对抗性后缀至上下文演示中。在多种任务与数据集上的大量实验结果表明,我们的LLM劫持攻击具有有效性,能导致注意力分散至对抗性标记,从而生成所需的不期望输出。