Large Language Models (LLMs) are seeing significant adoption in every type of organization due to their exceptional generative capabilities. However, LLMs are found to be vulnerable to various adversarial attacks, particularly prompt injection attacks, which trick them into producing harmful or inappropriate content. Adversaries execute such attacks by crafting malicious prompts to deceive the LLMs. In this paper, we propose a novel approach based on embedding-based Machine Learning (ML) classifiers to protect LLM-based applications against this severe threat. We leverage three commonly used embedding models to generate embeddings of malicious and benign prompts and utilize ML classifiers to predict whether an input prompt is malicious. Out of several traditional ML methods, we achieve the best performance with classifiers built using Random Forest and XGBoost. Our classifiers outperform state-of-the-art prompt injection classifiers available in open-source implementations, which use encoder-only neural networks.
翻译:大型语言模型(LLM)因其卓越的生成能力正被各类组织广泛采用。然而,研究发现LLM易受多种对抗攻击,尤其是提示注入攻击,此类攻击会诱使模型生成有害或不恰当内容。攻击者通过精心构造恶意提示来欺骗LLM。本文提出一种基于嵌入机器学习(ML)分类器的新方法,用于保护基于LLM的应用免受此类严重威胁。我们利用三种常用嵌入模型生成恶意提示与良性提示的嵌入向量,并采用ML分类器预测输入提示是否具有恶意性。在多种传统ML方法中,基于随机森林与XGBoost构建的分类器取得了最佳性能。我们的分类器在开源实现中超越了当前最先进的提示注入检测器(该类检测器采用仅编码器神经网络架构)。