Large language models (LLMs) are becoming a popular tool as they have significantly advanced in their capability to tackle a wide range of language-based tasks. However, LLMs applications are highly vulnerable to prompt injection attacks, which poses a critical problem. These attacks target LLMs applications through using carefully designed input prompts to divert the model from adhering to original instruction, thereby it could execute unintended actions. These manipulations pose serious security threats which potentially results in data leaks, biased outputs, or harmful responses. This project explores the security vulnerabilities in relation to prompt injection attacks. To detect whether a prompt is vulnerable or not, we follows two approaches: 1) a pre-trained LLM, and 2) a fine-tuned LLM. Then, we conduct a thorough analysis and comparison of the classification performance. Firstly, we use pre-trained XLM-RoBERTa model to detect prompt injections using test dataset without any fine-tuning and evaluate it by zero-shot classification. Then, this proposed work will apply supervised fine-tuning to this pre-trained LLM using a task-specific labeled dataset from deepset in huggingface, and this fine-tuned model achieves impressive results with 99.13\% accuracy, 100\% precision, 98.33\% recall and 99.15\% F1-score thorough rigorous experimentation and evaluation. We observe that our approach is highly efficient in detecting prompt injection attacks.
翻译:大型语言模型(LLMs)因其在处理广泛语言任务方面的显著能力提升,正成为一种流行工具。然而,LLMs应用极易受到提示注入攻击,这构成了一个关键问题。此类攻击通过精心设计的输入提示来针对LLMs应用,使模型偏离原始指令,从而可能执行非预期的操作。这些操纵行为会带来严重的安全威胁,可能导致数据泄露、偏见输出或有害响应。本项目探讨了与提示注入攻击相关的安全漏洞。为检测提示是否易受攻击,我们采用两种方法:1)使用预训练的LLM,2)使用微调的LLM。随后,我们对分类性能进行了全面分析与比较。首先,我们使用预训练的XLM-RoBERTa模型,在未经任何微调的情况下,通过零样本分类对测试数据集进行提示注入检测并评估其性能。接着,本研究将利用来自HuggingFace平台deepset的任务特定标注数据集,对该预训练LLM进行监督微调。经过严格的实验与评估,该微调模型取得了优异的结果:准确率达99.13%,精确率100%,召回率98.33%,F1分数99.15%。我们观察到,该方法在检测提示注入攻击方面具有高效性。