% Prompt injection attacks insert malicious instructions into an LLM's input to steer it toward an attacker-chosen task instead of the intended one. Existing detection defenses typically classify any input with instruction as malicious, leading to misclassification of benign inputs containing instructions that align with the intended task. In this work, we account for the instruction hierarchy and distinguish among three categories: inputs with misaligned instructions, inputs with aligned instructions, and non-instruction inputs. We introduce AlignSentinel, a three-class classifier that leverages features derived from LLM's attention maps to categorize inputs accordingly. To support evaluation, we construct the first systematic benchmark containing inputs from all three categories. Experiments on both our benchmark and existing ones--where inputs with aligned instructions are largely absent--show that AlignSentinel accurately detects inputs with misaligned instructions and substantially outperforms baselines.
翻译:提示注入攻击通过向大语言模型(LLM)的输入中插入恶意指令,以引导模型执行攻击者选定的任务而非原定任务。现有的检测防御方法通常将所有包含指令的输入均归类为恶意,导致那些包含与原定任务对齐的指令的良性输入被误判。本研究考虑指令的层次结构,区分三类输入:包含未对齐指令的输入、包含对齐指令的输入以及非指令输入。我们提出了AlignSentinel,这是一个利用从LLM注意力图中提取的特征对输入进行相应分类的三分类器。为支持评估,我们构建了首个包含所有三类输入的系统性基准测试集。在我们构建的基准集及现有基准集(其中包含对齐指令的输入基本缺失)上的实验表明,AlignSentinel能够准确检测包含未对齐指令的输入,并显著优于基线方法。