Large Language Models (LLMs) and Multi-Modal LLMs (MLLMs) have played a critical role in numerous applications. However, current LLMs are vulnerable to prompt-based attacks, with jailbreaking attacks enabling LLMs to generate harmful content, while hijacking attacks manipulate the model to perform unintended tasks, underscoring the necessity for detection methods. Unfortunately, existing detecting approaches are usually tailored to specific attacks, resulting in poor generalization in detecting various attacks across different modalities. To address it, we propose JailGuard, a universal detection framework for jailbreaking and hijacking attacks across LLMs and MLLMs. JailGuard operates on the principle that attacks are inherently less robust than benign ones, regardless of method or modality. Specifically, JailGuard mutates untrusted inputs to generate variants and leverages the discrepancy of the variants' responses on the model to distinguish attack samples from benign samples. We implement 18 mutators for text and image inputs and design a mutator combination policy to further improve detection generalization. To evaluate the effectiveness of JailGuard, we build the first comprehensive multi-modal attack dataset, containing 11,000 data items across 15 known attack types. The evaluation suggests that JailGuard achieves the best detection accuracy of 86.14%/82.90% on text and image inputs, outperforming state-of-the-art methods by 11.81%-25.73% and 12.20%-21.40%.
翻译:大语言模型(LLMs)与多模态大语言模型(MLLMs)在众多应用中发挥着关键作用。然而,当前的大语言模型易受提示攻击的影响:越狱攻击可使模型生成有害内容,而劫持攻击则会操纵模型执行非预期任务,这凸显了检测方法的必要性。遗憾的是,现有检测方法通常针对特定攻击设计,导致其在检测跨模态的各类攻击时泛化能力不足。为此,我们提出JailGuard——一个面向LLMs与MLLMs中越狱及劫持攻击的通用检测框架。JailGuard基于以下原理运作:无论攻击方法或模态如何,攻击本质上比良性输入更脆弱。具体而言,JailGuard通过对不可信输入进行变异生成变体,并利用这些变体在模型上响应结果的差异来区分攻击样本与良性样本。我们为文本和图像输入实现了18种变异器,并设计了变异器组合策略以进一步提升检测泛化能力。为评估JailGuard的有效性,我们构建了首个综合性多模态攻击数据集,涵盖15种已知攻击类型共11,000条数据。评估结果表明,JailGuard在文本与图像输入上分别达到86.14%/82.90%的最佳检测准确率,较现有最优方法提升11.81%-25.73%与12.20%-21.40%。