Prompt injection attacks pose a critical threat to large language models (LLMs), enabling goal hijacking and data leakage. Prompt guard models, though effective in defense, suffer from over-defense -- falsely flagging benign inputs as malicious due to trigger word bias. To address this issue, we introduce NotInject, an evaluation dataset that systematically measures over-defense across various prompt guard models. NotInject contains 339 benign samples enriched with trigger words common in prompt injection attacks, enabling fine-grained evaluation. Our results show that state-of-the-art models suffer from over-defense issues, with accuracy dropping close to random guessing levels (60%). To mitigate this, we propose InjecGuard, a novel prompt guard model that incorporates a new training strategy, Mitigating Over-defense for Free (MOF), which significantly reduces the bias on trigger words. InjecGuard demonstrates state-of-the-art performance on diverse benchmarks including NotInject, surpassing the existing best model by 30.8%, offering a robust and open-source solution for detecting prompt injection attacks. The code and datasets are released at https://github.com/SaFoLab-WISC/InjecGuard.
翻译:提示注入攻击对大型语言模型(LLMs)构成严重威胁,可能导致目标劫持与数据泄露。提示防护模型虽具备防御能力,却存在过度防御问题——因触发词偏见而将良性输入误判为恶意输入。为解决此问题,我们提出NotInject评估数据集,系统化度量各类提示防护模型的过度防御程度。NotInject包含339个富含提示注入攻击常见触发词的良性样本,支持细粒度评估。实验结果表明,现有先进模型普遍存在过度防御问题,其准确率下降至接近随机猜测水平(60%)。为缓解该问题,我们提出InjecGuard新型提示防护模型,其采用创新的训练策略“零成本缓解过度防御”(MOF),显著降低对触发词的偏见。InjecGuard在包括NotInject在内的多样化基准测试中展现领先性能,较现有最优模型提升30.8%,为检测提示注入攻击提供了鲁棒且开源的技术方案。代码与数据集已发布于https://github.com/SaFoLab-WISC/InjecGuard。