We are currently witnessing dramatic advances in the capabilities of Large Language Models (LLMs). They are already being adopted in practice and integrated into many systems, including integrated development environments (IDEs) and search engines. The functionalities of current LLMs can be modulated via natural language prompts, while their exact internal functionality remains implicit and unassessable. This property, which makes them adaptable to even unseen tasks, might also make them susceptible to targeted adversarial prompting. Recently, several ways to misalign LLMs using Prompt Injection (PI) attacks have been introduced. In such attacks, an adversary can prompt the LLM to produce malicious content or override the original instructions and the employed filtering schemes. Recent work showed that these attacks are hard to mitigate, as state-of-the-art LLMs are instruction-following. So far, these attacks assumed that the adversary is directly prompting the LLM. In this work, we show that augmenting LLMs with retrieval and API calling capabilities (so-called Application-Integrated LLMs) induces a whole new set of attack vectors. These LLMs might process poisoned content retrieved from the Web that contains malicious prompts pre-injected and selected by adversaries. We demonstrate that an attacker can indirectly perform such PI attacks. Based on this key insight, we systematically analyze the resulting threat landscape of Application-Integrated LLMs and discuss a variety of new attack vectors. To demonstrate the practical viability of our attacks, we implemented specific demonstrations of the proposed attacks within synthetic applications. In summary, our work calls for an urgent evaluation of current mitigation techniques and an investigation of whether new techniques are needed to defend LLMs against these threats.
翻译:我们正目睹大语言模型(LLMs)能力的巨大飞跃。这些模型已在实践中被采用,并集成到包括集成开发环境(IDE)和搜索引擎在内的诸多系统中。当前LLMs的功能可通过自然语言提示进行调节,而其内部精确功能仍保持隐式且不可评估。这种特性使其能够适应甚至未曾见过的任务,但也可能使其易受针对性的对抗性提示攻击。近期,已有多种利用提示注入(PI)攻击来使LLMs失配的方法被提出。在这类攻击中,攻击者可以提示LLM生成恶意内容,或覆盖原始指令及所采用的过滤方案。最新研究表明,这些攻击难以缓解,因为最先进的LLMs遵循指令。迄今为止,这些攻击假设攻击者直接对LLM进行提示。在本工作中,我们展示为LLMs增强检索和API调用能力(即所谓的“应用集成式LLM”)会引发一系列全新的攻击向量。这些LLM可能会处理从网络检索到的恶意内容,其中包含攻击者预先注入并选择的恶意提示。我们证明,攻击者可以间接实施此类PI攻击。基于这一关键洞察,我们系统分析了应用集成式LLMs所导致的威胁格局,并讨论多种新型攻击向量。为证明我们攻击的实际可行性,我们在合成应用中实现了所提议攻击的具体示例。综上所述,我们的工作呼吁迫切评估现有缓解技术,并探究是否需要新技术来防御LLMs面对这些威胁。