InjecGuard: Benchmarking and Mitigating Over-defense in Prompt Injection Guardrail Models

Prompt injection attacks pose a critical threat to large language models (LLMs), enabling goal hijacking and data leakage. Prompt guard models, though effective in defense, suffer from over-defense -- falsely flagging benign inputs as malicious due to trigger word bias. To address this issue, we introduce NotInject, an evaluation dataset that systematically measures over-defense across various prompt guard models. NotInject contains 339 benign samples enriched with trigger words common in prompt injection attacks, enabling fine-grained evaluation. Our results show that state-of-the-art models suffer from over-defense issues, with accuracy dropping close to random guessing levels (60%). To mitigate this, we propose InjecGuard, a novel prompt guard model that incorporates a new training strategy, Mitigating Over-defense for Free (MOF), which significantly reduces the bias on trigger words. InjecGuard demonstrates state-of-the-art performance on diverse benchmarks including NotInject, surpassing the existing best model by 30.8%, offering a robust and open-source solution for detecting prompt injection attacks. The code and datasets are released at https://github.com/SaFoLab-WISC/InjecGuard.

翻译：提示注入攻击对大型语言模型（LLMs）构成严重威胁，可能导致目标劫持与数据泄露。提示防护模型虽具备防御能力，却存在过度防御问题——因触发词偏见而将良性输入误判为恶意输入。为解决此问题，我们提出NotInject评估数据集，系统化度量各类提示防护模型的过度防御程度。NotInject包含339个富含提示注入攻击常见触发词的良性样本，支持细粒度评估。实验结果表明，现有先进模型普遍存在过度防御问题，其准确率下降至接近随机猜测水平（60%）。为缓解该问题，我们提出InjecGuard新型提示防护模型，其采用创新的训练策略“零成本缓解过度防御”（MOF），显著降低对触发词的偏见。InjecGuard在包括NotInject在内的多样化基准测试中展现领先性能，较现有最优模型提升30.8%，为检测提示注入攻击提供了鲁棒且开源的技术方案。代码与数据集已发布于https://github.com/SaFoLab-WISC/InjecGuard。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日