The advent of Large Language Models LLMs marks a milestone in Artificial Intelligence, altering how machines comprehend and generate human language. However, LLMs are vulnerable to malicious prompt injection attacks, where crafted inputs manipulate the models behavior in unintended ways, compromising system integrity and causing incorrect outcomes. Conventional detection methods rely on static, rule-based approaches, which often fail against sophisticated threats like abnormal token sequences and alias substitutions, leading to limited adaptability and higher rates of false positives and false negatives.This paper proposes a novel NLP based approach for prompt injection detection, emphasizing accuracy and optimization through a layered input screening process. In this framework, prompts are filtered through three distinct layers rule-based, ML classifier, and companion LLM before reaching the target model, thereby minimizing the risk of malicious interaction.Tests show the ML classifier achieves the highest accuracy among individual layers, yet the multi-layer framework enhances overall detection accuracy by reducing false negatives. Although this increases false positives, it minimizes the risk of overlooking genuine injected prompts, thus prioritizing security.This multi-layered detection approach highlights LLM vulnerabilities and provides a comprehensive framework for future research, promoting secure interactions between humans and AI systems.
翻译:大型语言模型(LLMs)的出现标志着人工智能领域的一个重要里程碑,改变了机器理解和生成人类语言的方式。然而,LLMs 容易受到恶意提示注入攻击,即精心设计的输入会以非预期的方式操控模型行为,损害系统完整性并导致错误结果。传统的检测方法依赖于静态的、基于规则的方法,这些方法在面对异常标记序列和别名替换等复杂威胁时往往失效,导致适应性有限以及较高的误报率和漏报率。本文提出了一种基于自然语言处理(NLP)的提示注入检测新方法,强调通过分层输入筛查过程实现高精度与优化。在该框架中,提示在到达目标模型之前需经过三个不同的过滤层——基于规则的层、机器学习分类器层以及辅助LLM层,从而最大限度地降低恶意交互的风险。测试表明,机器学习分类器在各独立层中实现了最高的准确率,而多层框架通过减少漏报率提升了整体检测精度。尽管这会增加误报率,但它最大限度地降低了忽略真实注入提示的风险,从而优先保障了安全性。这种多层检测方法突显了LLM的脆弱性,并为未来研究提供了一个全面的框架,以促进人类与AI系统之间的安全交互。