Robustness via Referencing: Defending against Prompt Injection Attacks by Referencing the Executed Instruction

Large language models (LLMs) have demonstrated impressive performance and have come to dominate the field of natural language processing (NLP) across various tasks. However, due to their strong instruction-following capabilities and inability to distinguish between instructions and data content, LLMs are vulnerable to prompt injection attacks. These attacks manipulate LLMs into deviating from the original input instructions and executing maliciously injected instructions within data content, such as web documents retrieved from search engines. Existing defense methods, including prompt-engineering and fine-tuning approaches, typically instruct models to follow the original input instructions while suppressing their tendencies to execute injected instructions. However, our experiments reveal that suppressing instruction-following tendencies is challenging. Through analyzing failure cases, we observe that although LLMs tend to respond to any recognized instructions, they are aware of which specific instructions they are executing and can correctly reference them within the original prompt. Motivated by these findings, we propose a novel defense method that leverages, rather than suppresses, the instruction-following abilities of LLMs. Our approach prompts LLMs to generate responses that include both answers and their corresponding instruction references. Based on these references, we filter out answers not associated with the original input instructions. Comprehensive experiments demonstrate that our method outperforms prompt-engineering baselines and achieves performance comparable to fine-tuning methods, reducing the attack success rate (ASR) to 0 percent in some scenarios. Moreover, our approach has minimal impact on overall utility.

翻译：大型语言模型（LLMs）展现出卓越性能，已在自然语言处理（NLP）各领域占据主导地位。然而，由于其强大的指令跟随能力与无法区分指令与数据内容的特性，LLMs易遭受提示注入攻击。此类攻击诱导LLMs偏离原始输入指令，转而执行数据内容中的恶意注入指令（如搜索引擎检索的网页文档）。现有防御方法（包括提示工程与微调方法）通常指示模型遵循原始输入指令，同时抑制其执行注入指令的倾向。但我们的实验表明，抑制指令跟随倾向颇具挑战性。通过分析失败案例，我们发现尽管LLMs倾向于响应任何可识别的指令，但它们能感知自身具体执行的指令，并能在原始提示中正确引用这些指令。受此启发，我们提出一种新型防御方法：利用而非抑制LLMs的指令跟随能力。该方法引导LLMs生成包含答案及其对应指令引用的响应，并基于引用过滤掉与原始输入指令无关的答案。综合实验表明，我们的方法优于提示工程基线方法，性能与微调方法相当，在部分场景中将攻击成功率（ASR）降至0%。此外，该方法对整体效用的影响微乎其微。