Large Language Models (LLMs) face threats from jailbreak prompts. Existing methods for detecting jailbreak prompts are primarily online moderation APIs or finetuned LLMs. These strategies, however, often require extensive and resource-intensive data collection and training processes. In this study, we propose GradSafe, which effectively detects jailbreak prompts by scrutinizing the gradients of safety-critical parameters in LLMs. Our method is grounded in a pivotal observation: the gradients of an LLM's loss for jailbreak prompts paired with compliance response exhibit similar patterns on certain safety-critical parameters. In contrast, safe prompts lead to different gradient patterns. Building on this observation, GradSafe analyzes the gradients from prompts (paired with compliance responses) to accurately detect jailbreak prompts. We show that GradSafe, applied to Llama-2 without further training, outperforms Llama Guard, despite its extensive finetuning with a large dataset, in detecting jailbreak prompts. This superior performance is consistent across both zero-shot and adaptation scenarios, as evidenced by our evaluations on ToxicChat and XSTest. The source code is available at https://github.com/xyq7/GradSafe.
翻译:大型语言模型(LLMs)面临越狱提示的威胁。现有的越狱提示检测方法主要依赖在线审核API或微调后的LLMs。然而,这些策略通常需要大量且资源密集的数据收集和训练过程。本研究提出GradSafe,该方法通过分析LLMs中安全关键参数的梯度,有效检测越狱提示。我们的方法基于一个关键观察:对于与合规响应配对的越狱提示,LLM损失函数在某些安全关键参数上呈现相似的梯度模式;相反,安全提示则导致不同的梯度模式。基于这一观察,GradSafe通过分析提示(与合规响应配对)产生的梯度来准确检测越狱提示。实验表明,将GradSafe应用于未经额外训练的Llama-2时,其在检测越狱提示方面的性能优于经过大规模数据集精细微调的Llama Guard。在ToxicChat和XSTest数据集上的评估证明,这种优越性能在零样本和自适应场景中均保持一致。源代码发布于https://github.com/xyq7/GradSafe。