大型语言模型防御提示注入与越狱攻击的系统文献综述：扩展NIST分类体系 (A Systematic Literature Review on LLM Defenses Against Prompt Injection and Jailbreaking: Expanding NIST Taxonomy)

Pedro H. Barcha Correia,Ryan W. Achjian,Diego E. G. Caetano de Oliveira,Ygor Acacio Maria,Victor Takashi Hayashi,Marcos Lopes,Charles Christian Miers,Marcos A. Simplicio

from arxiv, 27 pages, 14 figures, 11 tables, submitted to Elsevier Computer Science Review

The rapid advancement and widespread adoption of generative artificial intelligence (GenAI) and large language models (LLMs) has been accompanied by the emergence of new security vulnerabilities and challenges, such as jailbreaking and other prompt injection attacks. These maliciously crafted inputs can exploit LLMs, causing data leaks, unauthorized actions, or compromised outputs, for instance. As both offensive and defensive prompt injection techniques evolve quickly, a structured understanding of mitigation strategies becomes increasingly important. To address that, this work presents the first systematic literature review on prompt injection mitigation strategies, comprehending 88 studies. Building upon NIST's report on adversarial machine learning, this work contributes to the field through several avenues. First, it identifies studies beyond those documented in NIST's report and other academic reviews and surveys. Second, we propose an extension to NIST taxonomy by introducing additional categories of defenses. Third, by adopting NIST's established terminology and taxonomy as a foundation, we promote consistency and enable future researchers to build upon the standardized taxonomy proposed in this work. Finally, we provide a comprehensive catalog of the reviewed prompt injection defenses, documenting their reported quantitative effectiveness across specific LLMs and attack datasets, while also indicating which solutions are open-source and model-agnostic. This catalog, together with the guidelines presented herein, aims to serve as a practical resource for researchers advancing the field of adversarial machine learning and for developers seeking to implement effective defenses in production systems.

翻译：生成式人工智能（GenAI）与大型语言模型（LLMs）的快速发展和广泛应用，伴随着新型安全漏洞与挑战的出现，例如越狱攻击及其他形式的提示注入攻击。此类恶意构造的输入可能利用LLMs导致数据泄露、未授权操作或输出结果被篡改等问题。随着攻击性与防御性提示注入技术的快速发展，对缓解策略形成系统性理解变得日益重要。为此，本研究首次针对提示注入缓解策略开展系统文献综述，涵盖88项研究成果。基于美国国家标准与技术研究院（NIST）关于对抗性机器学习的报告，本研究通过以下途径推动该领域发展：首先，识别了超出NIST报告及其他学术综述已记载的研究成果；其次，通过引入新增防御类别，提出了对NIST分类体系的扩展方案；再次，通过采用NIST既定的术语体系与分类框架作为基础，促进了领域内的一致性，并为后续研究者基于本工作提出的标准化分类体系开展研究提供了可能；最后，我们建立了涵盖所有已综述提示注入防御措施的完整目录，记录了其在特定LLMs和攻击数据集上报告的量级化防御效能，同时标注了哪些解决方案属于开源且模型无关的。本目录与文中提出的指导原则，旨在为推进对抗性机器学习领域的研究者，以及需要在生产系统中实施有效防御的开发人员提供实用参考资源。