In-context learning, a paradigm bridging the gap between pre-training and fine-tuning, has demonstrated high efficacy in several NLP tasks, especially in few-shot settings. Despite being widely applied, in-context learning is vulnerable to malicious attacks. In this work, we raise security concerns regarding this paradigm. Our studies demonstrate that an attacker can manipulate the behavior of large language models by poisoning the demonstration context, without the need for fine-tuning the model. Specifically, we design a new backdoor attack method, named ICLAttack, to target large language models based on in-context learning. Our method encompasses two types of attacks: poisoning demonstration examples and poisoning demonstration prompts, which can make models behave in alignment with predefined intentions. ICLAttack does not require additional fine-tuning to implant a backdoor, thus preserving the model's generality. Furthermore, the poisoned examples are correctly labeled, enhancing the natural stealth of our attack method. Extensive experimental results across several language models, ranging in size from 1.3B to 180B parameters, demonstrate the effectiveness of our attack method, exemplified by a high average attack success rate of 95.0% across the three datasets on OPT models.
翻译:上下文学习作为一种连接预训练与微调的范式,已在多项自然语言处理任务中展现出高效能,尤其是在少样本场景下。尽管该范式被广泛应用,但其易受恶意攻击。本研究针对这一范式提出了安全关切。我们的研究表明,攻击者无需微调模型,仅需通过投毒演示上下文即可操控大型语言模型的行为。具体而言,我们设计了一种名为ICLAttack的新型后门攻击方法,专门针对基于上下文学习的大型语言模型。该方法包含两种攻击类型:投毒演示样例与投毒演示提示,能够使模型的行为与预设意图保持一致。ICLAttack无需额外微调即可植入后门,从而保留模型的通用性。此外,投毒样例均被正确标注,增强了攻击方法天然的隐蔽性。在参数规模从1.3B到180B的多个语言模型上的大量实验结果表明,我们攻击方法的有效性得以验证——例如在OPT模型上的三个数据集中,平均攻击成功率高达95.0%。