零样本嵌入漂移检测：针对大语言模型中提示注入攻击的轻量级防御 (Zero-Shot Embedding Drift Detection: A Lightweight Defense Against Prompt Injections in LLMs)

Prompt injection attacks have become an increasing vulnerability for LLM applications, where adversarial prompts exploit indirect input channels such as emails or user-generated content to circumvent alignment safeguards and induce harmful or unintended outputs. Despite advances in alignment, even state-of-the-art LLMs remain broadly vulnerable to adversarial prompts, underscoring the urgent need for robust, productive, and generalizable detection mechanisms beyond inefficient, model-specific patches. In this work, we propose Zero-Shot Embedding Drift Detection (ZEDD), a lightweight, low-engineering-overhead framework that identifies both direct and indirect prompt injection attempts by quantifying semantic shifts in embedding space between benign and suspect inputs. ZEDD operates without requiring access to model internals, prior knowledge of attack types, or task-specific retraining, enabling efficient zero-shot deployment across diverse LLM architectures. Our method uses adversarial-clean prompt pairs and measures embedding drift via cosine similarity to capture subtle adversarial manipulations inherent to real-world injection attacks. To ensure robust evaluation, we assemble and re-annotate the comprehensive LLMail-Inject dataset spanning five injection categories derived from publicly available sources. Extensive experiments demonstrate that embedding drift is a robust and transferable signal, outperforming traditional methods in detection accuracy and operational efficiency. With greater than 93% accuracy in classifying prompt injections across model architectures like Llama 3, Qwen 2, and Mistral and a false positive rate of <3%, our approach offers a lightweight, scalable defense layer that integrates into existing LLM pipelines, addressing a critical gap in securing LLM-powered systems to withstand adaptive adversarial threats.

翻译：提示注入攻击已成为大语言模型应用日益严重的漏洞，攻击者通过电子邮件或用户生成内容等间接输入渠道植入对抗性提示，以规避对齐安全机制并诱发有害或非预期输出。尽管对齐技术不断进步，即使是最先进的大语言模型仍普遍易受对抗性提示攻击，这凸显出亟需超越低效、模型特定修补方案的鲁棒、高效且可泛化的检测机制。本研究提出零样本嵌入漂移检测框架，该轻量级、低工程开销的框架通过量化良性输入与可疑输入在嵌入空间中的语义偏移，实现对直接与间接提示注入攻击的识别。ZEDD无需访问模型内部参数、预先掌握攻击类型知识或进行任务特定重训练，即可在不同大语言模型架构中实现高效的零样本部署。我们的方法采用对抗-干净提示对，通过余弦相似度度量嵌入漂移，以捕捉现实世界注入攻击中固有的细微对抗性操纵。为确保评估的鲁棒性，我们整合并重新标注了涵盖五大注入类别的综合LLMail-Inject数据集，所有数据均源自公开可用资源。大量实验表明，嵌入漂移是鲁棒且可迁移的检测信号，在检测精度与运行效率上均优于传统方法。该方法在Llama 3、Qwen 2和Mistral等模型架构上实现超过93%的提示注入分类准确率，且误报率低于3%，为现有大语言模型流水线提供了可集成的轻量级可扩展防御层，填补了大语言模型系统抵御自适应对抗威胁的关键安全空白。