With the advancement of technology, large language models (LLMs) have achieved remarkable performance across various natural language processing (NLP) tasks, powering LLM-integrated applications like Microsoft Copilot. However, as LLMs continue to evolve, new vulnerabilities, especially prompt injection attacks arise. These attacks trick LLMs into deviating from the original input instructions and executing the attacker's instructions injected in data content, such as retrieved results. Recent attack methods leverage LLMs' instruction-following abilities and their inabilities to distinguish instructions injected in the data content, and achieve a high attack success rate (ASR). When comparing the attack and defense methods, we interestingly find that they share similar design goals, of inducing the model to ignore unwanted instructions and instead to execute wanted instructions. Therefore, we raise an intuitive question: Could these attack techniques be utilized for defensive purposes? In this paper, we invert the intention of prompt injection methods to develop novel defense methods based on previous training-free attack methods, by repeating the attack process but with the original input instruction rather than the injected instruction. Our comprehensive experiments demonstrate that our defense techniques outperform existing training-free defense approaches, achieving state-of-the-art results.
翻译:随着技术进步,大型语言模型(LLMs)在各种自然语言处理(NLP)任务中取得了显著性能,并驱动了如Microsoft Copilot等LLM集成应用的发展。然而,随着LLMs的持续演进,新的漏洞——尤其是提示注入攻击——随之出现。这类攻击诱使LLMs偏离原始输入指令,转而执行攻击者注入在数据内容(例如检索结果)中的指令。近期的攻击方法利用LLMs的指令遵循能力及其无法区分数据内容中注入指令的缺陷,实现了较高的攻击成功率(ASR)。在对比攻击与防御方法时,我们有趣地发现它们具有相似的设计目标:诱导模型忽略不需要的指令,转而执行期望的指令。因此,我们提出一个直观的问题:这些攻击技术能否被用于防御目的?本文通过反转提示注入方法的意图,基于先前的免训练攻击方法开发了新型防御方法,其核心是重复攻击过程但使用原始输入指令而非注入指令。我们的综合实验表明,所提出的防御技术优于现有的免训练防御方法,取得了最先进的结果。