The increasing demand for customized Large Language Models (LLMs) has led to the development of solutions like GPTs. These solutions facilitate tailored LLM creation via natural language prompts without coding. However, the trustworthiness of third-party custom versions of LLMs remains an essential concern. In this paper, we propose the first instruction backdoor attacks against applications integrated with untrusted customized LLMs (e.g., GPTs). Specifically, these attacks embed the backdoor into the custom version of LLMs by designing prompts with backdoor instructions, outputting the attacker's desired result when inputs contain the pre-defined triggers. Our attack includes 3 levels of attacks: word-level, syntax-level, and semantic-level, which adopt different types of triggers with progressive stealthiness. We stress that our attacks do not require fine-tuning or any modification to the backend LLMs, adhering strictly to GPTs development guidelines. We conduct extensive experiments on 4 prominent LLMs and 5 benchmark text classification datasets. The results show that our instruction backdoor attacks achieve the desired attack performance without compromising utility. Additionally, we propose an instruction-ignoring defense mechanism and demonstrate its partial effectiveness in mitigating such attacks. Our findings highlight the vulnerability and the potential risks of LLM customization such as GPTs.
翻译:针对定制化大型语言模型(LLMs)日益增长的需求催生了如GPTs等解决方案。这些方案允许用户通过自然语言提示创建定制化LLM,无需编写代码。然而,第三方定制版LLM的可靠性仍是一个关键问题。本文首次提出针对集成不可信定制化LLM(如GPTs)的应用程序的指令后门攻击。具体而言,此类攻击通过设计包含后门指令的提示来将后门嵌入LLM的定制版本,当输入包含预定义触发器时输出攻击者期望的结果。我们的攻击包含三个层级:词级、句法级和语义级,采用不同隐蔽性递增的触发器类型。需要强调的是,我们的攻击无需对后端LLM进行微调或任何修改,且严格遵循GPTs的开发指南。我们在4种主流LLM和5个基准文本分类数据集上进行了广泛实验。结果表明,我们的指令后门攻击在不损害实用性的前提下达到了预期的攻击性能。此外,我们提出了一种指令忽略防御机制,并证明其在缓解此类攻击方面具有部分有效性。我们的发现凸显了LLM定制(如GPTs)的脆弱性与潜在风险。