Large language models (LLMs) are becoming increasingly prevalent in modern software systems, interfacing between the user and the Internet to assist with tasks that require advanced language understanding. To accomplish these tasks, the LLM often uses external data sources such as user documents, web retrieval, results from API calls, etc. This opens up new avenues for attackers to manipulate the LLM via prompt injection. Adversarial prompts can be injected into external data sources to override the system's intended instruction and instead execute a malicious instruction. To mitigate this vulnerability, we propose a new defense called SecAlign based on the technique of preference optimization. Our defense first constructs a preference dataset with prompt-injected inputs, secure outputs (ones that respond to the legitimate instruction), and insecure outputs (ones that respond to the injection). We then perform preference optimization on this dataset to teach the LLM to prefer the secure output over the insecure one. This provides the first known method that reduces the success rates of various prompt injections to around 0%, even against attacks much more sophisticated than ones seen during training. This indicates our defense generalizes well against unknown and yet-to-come attacks. Also, our defended models are still practical with similar utility to the one before our defensive training. Our code is at https://github.com/facebookresearch/SecAlign
翻译:大型语言模型(LLM)在现代软件系统中日益普及,作为用户与互联网之间的接口,协助完成需要高级语言理解的任务。为实现这些任务,LLM常使用外部数据源,如用户文档、网络检索结果、API调用结果等。这为攻击者通过提示注入操纵LLM开辟了新途径。对抗性提示可被注入外部数据源,以覆盖系统的原始指令并转而执行恶意指令。为缓解此漏洞,我们提出一种基于偏好优化技术的新型防御方法SecAlign。该防御首先构建包含提示注入输入、安全输出(响应合法指令)与不安全输出(响应注入指令)的偏好数据集。随后在此数据集上进行偏好优化,以教导LLM优先选择安全输出而非不安全输出。该方法成为首个已知能将各类提示注入成功率降至接近0%的技术,即使面对比训练时所见更为复杂的攻击亦能有效应对。这表明我们的防御对未知及未来可能出现的攻击具有良好的泛化能力。同时,经过防御训练的模型仍保持实用性,其功能效用与训练前模型相近。代码发布于https://github.com/facebookresearch/SecAlign