The increasing demand for customized Large Language Models (LLMs) has led to the development of solutions like GPTs. These solutions facilitate tailored LLM creation via natural language prompts without coding. However, the trustworthiness of third-party custom versions of LLMs remains an essential concern. In this paper, we propose the first instruction backdoor attacks against applications integrated with untrusted customized LLMs (e.g., GPTs). Specifically, these attacks embed the backdoor into the custom version of LLMs by designing prompts with backdoor instructions, outputting the attacker's desired result when inputs contain the pre-defined triggers. Our attack includes 3 levels of attacks: word-level, syntax-level, and semantic-level, which adopt different types of triggers with progressive stealthiness. We stress that our attacks do not require fine-tuning or any modification to the backend LLMs, adhering strictly to GPTs development guidelines. We conduct extensive experiments on 6 prominent LLMs and 5 benchmark text classification datasets. The results show that our instruction backdoor attacks achieve the desired attack performance without compromising utility. Additionally, we propose two defense strategies and demonstrate their effectiveness in reducing such attacks. Our findings highlight the vulnerability and the potential risks of LLM customization such as GPTs.
翻译:随着对定制化大语言模型(LLMs)需求的日益增长,诸如GPTs等解决方案应运而生。这些解决方案通过自然语言提示(无需编码)促进了定制化LLM的创建。然而,第三方定制版LLM的可信度仍然是一个至关重要的问题。在本文中,我们首次提出了针对集成了不可信定制化LLM(例如GPTs)的应用程序的指令后门攻击。具体而言,这些攻击通过设计包含后门指令的提示词,将后门嵌入到定制版LLM中,当输入包含预定义触发器时,模型会输出攻击者期望的结果。我们的攻击包括三个层次的攻击:词级、句法级和语义级,它们采用了不同类型的触发器,其隐蔽性依次递增。我们强调,我们的攻击不需要对后端LLM进行微调或任何修改,严格遵循GPTs的开发指南。我们在6个主流LLM和5个基准文本分类数据集上进行了广泛的实验。结果表明,我们的指令后门攻击在实现预期攻击性能的同时,并未损害模型的实用性。此外,我们提出了两种防御策略,并证明了它们在降低此类攻击方面的有效性。我们的研究结果凸显了诸如GPTs等LLM定制化方案存在的脆弱性和潜在风险。