Large language models (LLMs) increasingly rely on retrieving information from external corpora. This creates a new attack surface: indirect prompt injection (IPI), where hidden instructions are planted in the corpora and hijack model behavior once retrieved. Previous studies have highlighted this risk but often avoid the hardest step: ensuring that malicious content is actually retrieved. In practice, unoptimized IPI is rarely retrieved under natural queries, which leaves its real-world impact unclear. We address this challenge by decomposing the malicious content into a trigger fragment that guarantees retrieval and an attack fragment that encodes arbitrary attack objectives. Based on this idea, we design an efficient and effective black-box attack algorithm that constructs a compact trigger fragment to guarantee retrieval for any attack fragment. Our attack requires only API access to embedding models, is cost-efficient (as little as $0.21 per target user query on OpenAI's embedding models), and achieves near-100% retrieval across 11 benchmarks and 8 embedding models (including both open-source models and proprietary services). Based on this attack, we present the first end-to-end IPI exploits under natural queries and realistic external corpora, spanning both RAG and agentic systems with diverse attack objectives. These results establish IPI as a practical and severe threat: when a user issued a natural query to summarize emails on frequently asked topics, a single poisoned email was sufficient to coerce GPT-4o into exfiltrating SSH keys with over 80% success in a multi-agent workflow. We further evaluate several defenses and find that they are insufficient to prevent the retrieval of malicious text, highlighting retrieval as a critical open vulnerability.
翻译:大型语言模型(LLM)日益依赖从外部语料库中检索信息。这产生了一个新的攻击面:间接提示注入(IPI),即恶意指令被植入语料库中,一旦被检索到就会劫持模型行为。先前的研究已指出这一风险,但往往回避了最困难的步骤:确保恶意内容实际被检索到。实践中,未经优化的IPI在自然查询下很少被检索,这使得其实际影响尚不明确。我们通过将恶意内容分解为两个部分来解决这一挑战:保证检索的触发片段和编码任意攻击目标的攻击片段。基于这一思路,我们设计了一种高效的黑盒攻击算法,该算法构建紧凑的触发片段以确保任意攻击片段均能被检索。我们的攻击仅需嵌入模型的API访问权限,具有成本效益(在OpenAI的嵌入模型上,每个目标用户查询成本可低至0.21美元),并在11个基准测试和8个嵌入模型(包括开源模型和专有服务)上实现了接近100%的检索成功率。基于此攻击,我们首次在自然查询和现实外部语料库条件下展示了端到端的IPI攻击实例,涵盖RAG系统和智能体系统,并实现了多样化的攻击目标。这些结果表明IPI是一种切实存在且严重的威胁:当用户发出自然查询以总结常见主题的电子邮件时,仅需一封被投毒的邮件就足以在多智能体工作流中胁迫GPT-4o泄露SSH密钥,成功率超过80%。我们进一步评估了多种防御措施,发现它们均不足以阻止恶意文本的检索,这凸显了检索环节仍是一个关键且未解决的漏洞。