Large Language Model (LLM) agents are susceptible to Indirect Prompt Injection (IPI) attacks, where malicious instructions in retrieved content hijack the agent's execution. Existing defenses typically rely on strict filtering or refusal mechanisms, which suffer from a critical limitation: over-refusal, prematurely terminating valid agentic workflows. We propose ICON, a probing-to-mitigation framework that neutralizes attacks while preserving task continuity. Our key insight is that IPI attacks leave distinct over-focusing signatures in the latent space. We introduce a Latent Space Trace Prober to detect attacks based on high intensity scores. Subsequently, a Mitigating Rectifier performs surgical attention steering that selectively manipulate adversarial query key dependencies while amplifying task relevant elements to restore the LLM's functional trajectory. Extensive evaluations on multiple backbones show that ICON achieves a competitive 0.4% ASR, matching commercial grade detectors, while yielding a over 50% task utility gain. Furthermore, ICON demonstrates robust Out of Distribution(OOD) generalization and extends effectively to multi-modal agents, establishing a superior balance between security and efficiency.
翻译:大型语言模型(LLM)智能体易受间接提示注入(IPI)攻击,即检索内容中的恶意指令劫持智能体的执行流程。现有防御方法通常依赖严格过滤或拒绝机制,存在一个关键局限:过度拒绝,即过早终止有效的智能体工作流。我们提出ICON,一种从探测到缓解的框架,可在消除攻击的同时保持任务连续性。我们的核心洞见是:IPI攻击会在潜在空间中留下显著的过度聚焦特征。我们引入潜在空间轨迹探测器,依据高强度评分检测攻击。随后,缓解校正器执行精准的注意力引导,选择性操纵对抗性查询-键依赖关系,同时增强任务相关元素,以恢复LLM的功能轨迹。在多种骨干模型上的广泛评估表明,ICON实现了0.4%的竞争性攻击成功率(ASR),与商用级检测器相当,同时带来超过50%的任务效用增益。此外,ICON展现出强大的分布外(OOD)泛化能力,并能有效扩展至多模态智能体,在安全性与效率间建立了更优的平衡。