CausalArmor: Efficient Indirect Prompt Injection Guardrails via Causal Attribution

AI agents equipped with tool-calling capabilities are susceptible to Indirect Prompt Injection (IPI) attacks. In this attack scenario, malicious commands hidden within untrusted content trick the agent into performing unauthorized actions. Existing defenses can reduce attack success but often suffer from the over-defense dilemma: they deploy expensive, always-on sanitization regardless of actual threat, thereby degrading utility and latency even in benign scenarios. We revisit IPI through a causal ablation perspective: a successful injection manifests as a dominance shift where the user request no longer provides decisive support for the agent's privileged action, while a particular untrusted segment, such as a retrieved document or tool output, provides disproportionate attributable influence. Based on this signature, we propose CausalArmor, a selective defense framework that (i) computes lightweight, leave-one-out ablation-based attributions at privileged decision points, and (ii) triggers targeted sanitization only when an untrusted segment dominates the user intent. Additionally, CausalArmor employs retroactive Chain-of-Thought masking to prevent the agent from acting on ``poisoned'' reasoning traces. We present a theoretical analysis showing that sanitization based on attribution margins conditionally yields an exponentially small upper bound on the probability of selecting malicious actions. Experiments on AgentDojo and DoomArena demonstrate that CausalArmor matches the security of aggressive defenses while improving explainability and preserving utility and latency of AI agents.

翻译：具备工具调用能力的AI智能体易受间接提示注入攻击。在此类攻击场景中，隐藏于不可信内容中的恶意指令会诱使智能体执行未授权操作。现有防御方法虽能降低攻击成功率，但常面临过度防御困境：无论实际威胁是否存在，均采用持续运行的高成本净化机制，导致即使在良性场景下也会损害系统效用并增加延迟。本文通过因果消融视角重新审视间接提示注入：成功的注入表现为支配性转移——用户请求不再为智能体的特权操作提供决定性支持，而特定不可信片段（如检索文档或工具输出）产生不成比例的归因性影响。基于此特征，我们提出CausalArmor选择性防御框架，其具备以下特性：（i）在特权决策点计算基于留一法消融的轻量级归因；（ii）仅当不可信片段支配用户意图时触发定向净化。此外，CausalArmor采用追溯式思维链掩码机制，防止智能体基于“污染”推理轨迹执行操作。理论分析表明，基于归因边际的净化机制能以指数级小的上界条件化约束恶意操作的选择概率。在AgentDojo与DoomArena平台上的实验证明，CausalArmor在保持攻击防御能力与激进方案相当的同时，显著提升可解释性，并有效维护AI智能体的实用性与响应延迟。