Large Language Models (LLMs) have revolutionized various domains but remain vulnerable to prompt injection attacks, where malicious inputs manipulate the model into ignoring original instructions and executing designated action. In this paper, we investigate the underlying mechanisms of these attacks by analyzing the attention patterns within LLMs. We introduce the concept of the distraction effect, where specific attention heads, termed important heads, shift focus from the original instruction to the injected instruction. Building on this discovery, we propose Attention Tracker, a training-free detection method that tracks attention patterns on instruction to detect prompt injection attacks without the need for additional LLM inference. Our method generalizes effectively across diverse models, datasets, and attack types, showing an AUROC improvement of up to 10.0% over existing methods, and performs well even on small LLMs. We demonstrate the robustness of our approach through extensive evaluations and provide insights into safeguarding LLM-integrated systems from prompt injection vulnerabilities.
翻译:大型语言模型(LLMs)已在多个领域引发变革,但仍易受提示注入攻击的影响,即恶意输入操纵模型使其忽略原始指令并执行指定操作。本文通过分析LLMs内部的注意力模式,深入探究了此类攻击的内在机制。我们引入了“干扰效应”的概念,即某些被称为重要头部的注意力头会将关注点从原始指令转移至注入指令。基于这一发现,我们提出了注意力追踪器——一种无需训练的检测方法,通过追踪指令上的注意力模式来检测提示注入攻击,且无需额外的LLM推理过程。该方法在不同模型、数据集和攻击类型上均展现出良好的泛化能力,其AUROC指标较现有方法提升最高达10.0%,即使在小型LLMs上也能有效运行。我们通过大量实验验证了该方法的鲁棒性,并为保护集成LLM的系统免受提示注入漏洞威胁提供了理论依据。