Large Language Models (LLMs) are rapidly evolving into agentic systems that interact with external tools and environments, introducing new security risks such as indirect prompt injection attacks through untrusted external sources. Existing defenses mainly focus on blocking malicious content at inference time, and current red-teaming methods primarily optimize attack success. As a result, developers have limited visibility into how latent prompt injections emerge and propagate through agents. We propose PI-Hunter, an automated agentic auditing framework for proactive vulnerability exposure in LLM agents. PI-Hunter constructs realistic source-aware test cases and iteratively evolves them through feedback-driven exploration to induce agents to retrieve and reveal latent malicious instructions embedded within external environments. Extensive experiments across multiple benchmarks, agent architectures, attacks, and defenses demonstrate that PI-Hunter substantially improves vulnerability exposure and attack-surface coverage over strong automated red-teaming baselines, while remaining effective under existing prompt injection defenses.
翻译:摘要:大型语言模型(LLM)正迅速演变为与外部工具及环境交互的智能体系统,由此引入了通过不可信外部源实施间接提示注入攻击等新型安全风险。现有防御措施主要聚焦于推理阶段的恶意内容拦截,而当前的红队测试方法则主要优化攻击成功率。因此,开发者对于潜在提示注入如何产生并在智能体间传播的可见性极为有限。我们提出PI-Hunter——一种面向LLM智能体的自动化审计框架,旨在主动暴露其脆弱性。该框架通过构建符合实际的源感知测试用例,并基于反馈驱动探索迭代演化这些用例,诱使智能体检索并揭示嵌入外部环境中的潜在恶意指令。跨多个基准测试、智能体架构、攻击类型及防御措施的大量实验表明,与强自动化红队测试基线相比,PI-Hunter显著提升了脆弱性暴露水平与攻击面覆盖率,且对现有提示注入防御措施依然有效。