The increasing demand for customized Large Language Models (LLMs) has led to the development of solutions like GPTs. These solutions facilitate tailored LLM creation via natural language prompts without coding. However, the trustworthiness of third-party custom versions of LLMs remains an essential concern. In this paper, we propose the first instruction backdoor attacks against applications integrated with untrusted customized LLMs (e.g., GPTs). Specifically, these attacks embed the backdoor into the custom version of LLMs by designing prompts with backdoor instructions, outputting the attacker's desired result when inputs contain the pre-defined triggers. Our attack includes 3 levels of attacks: word-level, syntax-level, and semantic-level, which adopt different types of triggers with progressive stealthiness. We stress that our attacks do not require fine-tuning or any modification to the backend LLMs, adhering strictly to GPTs development guidelines. We conduct extensive experiments on 4 prominent LLMs and 5 benchmark text classification datasets. The results show that our instruction backdoor attacks achieve the desired attack performance without compromising utility. Additionally, we propose an instruction-ignoring defense mechanism and demonstrate its partial effectiveness in mitigating such attacks. Our findings highlight the vulnerability and the potential risks of LLM customization such as GPTs.
翻译:随着对定制化大型语言模型(LLM)需求的增长,GPTs等解决方案应运而生。这类方案通过自然语言提示实现无代码的LLM定制,然而第三方定制版本的可信度仍是关键问题。本文首次针对集成不可信定制LLM(如GPTs)的应用提出指令后门攻击。具体而言,该攻击通过设计包含后门指令的提示,将后门嵌入LLM定制版本,当输入包含预设触发器时输出攻击者期望的结果。我们的攻击包含三个层级:词汇级、句法级和语义级,采用渐进隐蔽性的不同触发器类型。需强调的是,本攻击无需微调或修改后端LLM,严格遵循GPTs开发规范。我们在4种主流LLM和5个基准文本分类数据集上开展广泛实验,结果表明指令后门攻击在保持实用性的同时达到了预期攻击效果。此外,我们提出指令忽略防御机制,验证其在缓解此类攻击方面的部分有效性。研究揭示了LLM定制(如GPTs)的脆弱性与潜在风险。