LLM agents are highly vulnerable to Indirect Prompt Injection (IPI), where adversaries embed malicious directives in untrusted tool outputs to hijack execution. Most existing defenses treat IPI as an input-level semantic discrimination problem, which often fails to generalize to unseen payloads. We propose a new paradigm, action-level causal attribution, which secures agents by asking why a particular tool call is produced. The central goal is to distinguish tool calls supported by the user's intent from those causally driven by untrusted observations. We instantiate this paradigm with AttriGuard, a runtime defense based on parallel counterfactual tests. For each proposed tool call, AttriGuard verifies its necessity by re-executing the agent under a control-attenuated view of external observations. Technically, AttriGuard combines teacher-forced shadow replay to prevent attribution confounding, hierarchical control attenuation to suppress diverse control channels while preserving task-relevant information, and a fuzzy survival criterion that is robust to LLM stochasticity. Across four LLMs and two agent benchmarks, AttriGuard achieves 0% ASR under static attacks with negligible utility loss and moderate overhead. Importantly, it remains resilient under adaptive optimization-based attacks in settings where leading defenses degrade significantly.
翻译:LLM代理极易受到间接提示注入(IPI)攻击,攻击者可在不可信工具输出中嵌入恶意指令以劫持执行流程。现有防御大多将IPI视为输入层面的语义判别问题,往往难以泛化至未见过的攻击载荷。我们提出一种新范式——行动级因果归因,通过追问特定工具调用产生的根本原因来保障代理安全。其核心目标是区分基于用户意图的工具调用与由不可信观察因果驱动的工具调用。我们通过AttriGuard实例化该范式,这是一种基于并行反事实测试的运行时防御机制。针对每个待执行工具调用,AttriGuard通过在外部观察的控制衰减视图下重新运行代理来验证其必要性。技术层面,AttriGuard融合了以下机制:教师强制影子重放以防止归因混淆,分层控制衰减以抑制多样控制通道同时保留任务相关信息,以及能有效应对LLM随机性的模糊生存准则。在四种LLM与两个代理基准测试中,AttriGuard在静态攻击下实现0%攻击成功率,且效用损失可忽略、开销适中。尤为重要的是,在现有主流防御显著失效的场景下,该方法对基于自适应优化的攻击仍保持强韧性。