Large language models (LLMs) and their applications, such as agents, are highly vulnerable to prompt injection attacks. State-of-the-art prompt injection detection methods have the following limitations: (1) their effectiveness degrades significantly as context length increases, and (2) they lack explicit rules that define what constitutes prompt injection, causing detection decisions to be implicit, opaque, and difficult to reason about. In this work, we propose AgentWatcher to address the above two limitations. To address the first limitation, AgentWatcher attributes the LLM's output (e.g., the action of an agent) to a small set of causally influential context segments. By focusing detection on a relatively short text, AgentWatcher can be scalable to long contexts. To address the second limitation, we define a set of rules specifying what does and does not constitute a prompt injection, and use a monitor LLM to reason over these rules based on the attributed text, making the detection decisions more explainable. We conduct a comprehensive evaluation on tool-use agent benchmarks and long-context understanding datasets. The experimental results demonstrate that AgentWatcher can effectively detect prompt injection and maintain utility without attacks. The code is available at https://github.com/wang-yanting/AgentWatcher.
翻译:大型语言模型(LLM)及其应用(如智能代理)极易受到提示注入攻击。当前最先进的提示注入检测方法存在以下局限性:(1)随着上下文长度增加,其检测效能显著下降;(2)缺乏明确定义何为提示注入的显式规则,导致检测决策隐含、不透明且难以推理。为此,我们提出AgentWatcher以解决上述两个问题。针对第一个局限性,AgentWatcher将LLM的输出(如代理的动作)归因于少量具有因果影响力的上下文片段。通过聚焦于相对较短的文本进行检测,AgentWatcher可扩展至长上下文场景。针对第二个局限性,我们定义了一组明确界定提示注入与否的规则,并利用监控LLM基于归因文本对这些规则进行推理,使检测决策更具可解释性。我们在工具调用型代理基准测试和长上下文理解数据集上进行了全面评估。实验结果表明,AgentWatcher能有效检测提示注入,并在无攻击场景下保持可用性。代码开源地址:https://github.com/wang-yanting/AgentWatcher