Large Language Models (LLMs) like GPT-4, LLaMA, and Qwen have demonstrated remarkable success across a wide range of applications. However, these models remain inherently vulnerable to prompt injection attacks, which can bypass existing safety mechanisms, highlighting the urgent need for more robust attack detection methods and comprehensive evaluation benchmarks. To address these challenges, we introduce GenTel-Safe, a unified framework that includes a novel prompt injection attack detection method, GenTel-Shield, along with a comprehensive evaluation benchmark, GenTel-Bench, which compromises 84812 prompt injection attacks, spanning 3 major categories and 28 security scenarios. To prove the effectiveness of GenTel-Shield, we evaluate it together with vanilla safety guardrails against the GenTel-Bench dataset. Empirically, GenTel-Shield can achieve state-of-the-art attack detection success rates, which reveals the critical weakness of existing safeguarding techniques against harmful prompts. For reproducibility, we have made the code and benchmarking dataset available on the project page at https://gentellab.github.io/gentel-safe.github.io/.
翻译:以GPT-4、LLaMA、Qwen为代表的大型语言模型(LLMs)已在广泛的应用中取得了显著成功。然而,这些模型本质上仍然容易受到提示注入攻击,此类攻击能够绕过现有的安全机制,这凸显了对更鲁棒的攻击检测方法和全面评估基准的迫切需求。为应对这些挑战,我们提出了GenTel-Safe,一个统一的框架。该框架包含一种新颖的提示注入攻击检测方法GenTel-Shield,以及一个全面的评估基准GenTel-Bench。该基准涵盖了84812个提示注入攻击,横跨3个主要类别和28个安全场景。为验证GenTel-Shield的有效性,我们将其与基础安全防护措施一同在GenTel-Bench数据集上进行了评估。实验表明,GenTel-Shield能够实现最先进的攻击检测成功率,这揭示了现有防护技术在应对有害提示方面的关键弱点。为确保可复现性,我们已将代码和基准数据集发布于项目页面:https://gentellab.github.io/gentel-safe.github.io/。