Large language models (LLMs) like ChatGPT and Gemini have significantly advanced natural language processing, enabling various applications such as chatbots and automated content generation. However, these models can be exploited by malicious individuals who craft toxic prompts to elicit harmful or unethical responses. These individuals often employ jailbreaking techniques to bypass safety mechanisms, highlighting the need for robust toxic prompt detection methods. Existing detection techniques, both blackbox and whitebox, face challenges related to the diversity of toxic prompts, scalability, and computational efficiency. In response, we propose ToxicDetector, a lightweight greybox method designed to efficiently detect toxic prompts in LLMs. ToxicDetector leverages LLMs to create toxic concept prompts, uses embedding vectors to form feature vectors, and employs a Multi-Layer Perceptron (MLP) classifier for prompt classification. Our evaluation on various versions of the LLama models, Gemma-2, and multiple datasets demonstrates that ToxicDetector achieves a high accuracy of 96.39\% and a low false positive rate of 2.00\%, outperforming state-of-the-art methods. Additionally, ToxicDetector's processing time of 0.0780 seconds per prompt makes it highly suitable for real-time applications. ToxicDetector achieves high accuracy, efficiency, and scalability, making it a practical method for toxic prompt detection in LLMs.
翻译:诸如ChatGPT和Gemini等大型语言模型(LLMs)显著推动了自然语言处理的发展,催生了聊天机器人和自动化内容生成等多种应用。然而,这些模型可能被恶意用户利用,通过精心构造恶意提示来诱导产生有害或不道德的回复。此类用户常采用越狱技术绕过安全机制,这凸显了开发鲁棒的恶意提示检测方法的必要性。现有的黑盒与白盒检测技术在应对恶意提示的多样性、可扩展性及计算效率方面均面临挑战。为此,我们提出ToxicDetector——一种轻量级灰盒方法,旨在高效检测大型语言模型中的恶意提示。ToxicDetector利用LLMs生成恶意概念提示,通过嵌入向量构建特征向量,并采用多层感知机(MLP)分类器进行提示分类。我们在多个版本的LLama模型、Gemma-2及多种数据集上的评估表明,ToxicDetector实现了96.39%的高准确率和2.00%的低误报率,性能优于现有先进方法。此外,ToxicDetector对每条提示的处理时间仅为0.0780秒,使其非常适用于实时应用场景。该方法在准确性、效率与可扩展性方面均表现优异,为LLMs中的恶意提示检测提供了一种实用解决方案。