JailGuard: A Universal Detection Framework for LLM Prompt-based Attacks

Large Language Models (LLMs) and Multi-Modal LLMs (MLLMs) have played a critical role in numerous applications. However, current LLMs are vulnerable to prompt-based attacks, with jailbreaking attacks enabling LLMs to generate harmful content, while hijacking attacks manipulate the model to perform unintended tasks, underscoring the necessity for detection methods. Unfortunately, existing detecting approaches are usually tailored to specific attacks, resulting in poor generalization in detecting various attacks across different modalities. To address it, we propose JailGuard, a universal detection framework for jailbreaking and hijacking attacks across LLMs and MLLMs. JailGuard operates on the principle that attacks are inherently less robust than benign ones, regardless of method or modality. Specifically, JailGuard mutates untrusted inputs to generate variants and leverages the discrepancy of the variants' responses on the model to distinguish attack samples from benign samples. We implement 18 mutators for text and image inputs and design a mutator combination policy to further improve detection generalization. To evaluate the effectiveness of JailGuard, we build the first comprehensive multi-modal attack dataset, containing 11,000 data items across 15 known attack types. The evaluation suggests that JailGuard achieves the best detection accuracy of 86.14%/82.90% on text and image inputs, outperforming state-of-the-art methods by 11.81%-25.73% and 12.20%-21.40%.

翻译：大语言模型（LLMs）与多模态大语言模型（MLLMs）在众多应用中发挥着关键作用。然而，当前的大语言模型易受提示攻击的影响：越狱攻击可使模型生成有害内容，而劫持攻击则会操纵模型执行非预期任务，这凸显了检测方法的必要性。遗憾的是，现有检测方法通常针对特定攻击设计，导致其在检测跨模态的各类攻击时泛化能力不足。为此，我们提出JailGuard——一个面向LLMs与MLLMs中越狱及劫持攻击的通用检测框架。JailGuard基于以下原理运作：无论攻击方法或模态如何，攻击本质上比良性输入更脆弱。具体而言，JailGuard通过对不可信输入进行变异生成变体，并利用这些变体在模型上响应结果的差异来区分攻击样本与良性样本。我们为文本和图像输入实现了18种变异器，并设计了变异器组合策略以进一步提升检测泛化能力。为评估JailGuard的有效性，我们构建了首个综合性多模态攻击数据集，涵盖15种已知攻击类型共11,000条数据。评估结果表明，JailGuard在文本与图像输入上分别达到86.14%/82.90%的最佳检测准确率，较现有最优方法提升11.81%-25.73%与12.20%-21.40%。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日