AgentSentry: Mitigating Indirect Prompt Injection in LLM Agents via Temporal Causal Diagnostics and Context Purification

Large language model (LLM) agents increasingly rely on external tools and retrieval systems to autonomously complete complex tasks. However, this design exposes agents to indirect prompt injection (IPI), where attacker-controlled context embedded in tool outputs or retrieved content silently steers agent actions away from user intent. Unlike prompt-based attacks, IPI unfolds over multi-turn trajectories, making malicious control difficult to disentangle from legitimate task execution. Existing inference-time defenses primarily rely on heuristic detection and conservative blocking of high-risk actions, which can prematurely terminate workflows or broadly suppress tool usage under ambiguous multi-turn scenarios. We propose AgentSentry, a novel inference-time detection and mitigation framework for tool-augmented LLM agents. To the best of our knowledge, AgentSentry is the first inference-time defense to model multi-turn IPI as a temporal causal takeover. It localizes takeover points via controlled counterfactual re-executions at tool-return boundaries and enables safe continuation through causally guided context purification that removes attack-induced deviations while preserving task-relevant evidence. We evaluate AgentSentry on the \textsc{AgentDojo} benchmark across four task suites, three IPI attack families, and multiple black-box LLMs. AgentSentry eliminates successful attacks and maintains strong utility under attack, achieving an average Utility Under Attack (UA) of 74.55 %, improving UA by 20.8 to 33.6 percentage points over the strongest baselines without degrading benign performance.

翻译：大型语言模型（LLM）智能体日益依赖外部工具与检索系统以自主完成复杂任务。然而，这种设计使智能体面临间接提示注入（IPI）攻击的风险，即攻击者通过嵌入工具输出或检索内容中的受控上下文，悄无声息地将智能体行为引导至偏离用户意图的方向。与基于提示的攻击不同，IPI在多轮交互轨迹中逐步展开，使得恶意控制难以从合法的任务执行中分离。现有的推理时防御主要依赖启发式检测和对高风险行为的保守阻断，这在模糊的多轮场景下可能导致工作流过早终止或广泛抑制工具使用。我们提出了AgentSentry，一种面向工具增强型LLM智能体的新型推理时检测与缓解框架。据我们所知，AgentSentry是首个将多轮IPI建模为时序因果劫持的推理时防御方法。它通过在工具返回边界进行受控反事实重执行以定位劫持点，并借助因果引导的上下文净化实现安全延续——该方法在保留任务相关证据的同时消除攻击引发的偏差。我们在\textsc{AgentDojo}基准测试上对AgentSentry进行了评估，涵盖四个任务套件、三类IPI攻击家族及多种黑盒LLM。AgentSentry成功消除了所有攻击，并在受攻击时保持高效用，平均攻击下效用（UA）达到74.55%，较最强基线提升了20.8至33.6个百分点，且不影响正常场景下的性能表现。