Large Language Models (LLMs) face threats from unsafe prompts. Existing methods for detecting unsafe prompts are primarily online moderation APIs or finetuned LLMs. These strategies, however, often require extensive and resource-intensive data collection and training processes. In this study, we propose GradSafe, which effectively detects unsafe prompts by scrutinizing the gradients of safety-critical parameters in LLMs. Our methodology is grounded in a pivotal observation: the gradients of an LLM's loss for unsafe prompts paired with compliance response exhibit similar patterns on certain safety-critical parameters. In contrast, safe prompts lead to markedly different gradient patterns. Building on this observation, GradSafe analyzes the gradients from prompts (paired with compliance responses) to accurately detect unsafe prompts. We show that GradSafe, applied to Llama-2 without further training, outperforms Llama Guard, despite its extensive finetuning with a large dataset, in detecting unsafe prompts. This superior performance is consistent across both zero-shot and adaptation scenarios, as evidenced by our evaluations on the ToxicChat and XSTest. The source code is available at https://github.com/xyq7/GradSafe.
翻译:大语言模型(LLMs)面临来自不安全提示的威胁。现有检测不安全提示的方法主要依赖于在线审核API或微调的LLMs。然而,这些策略通常需要大量且资源密集型的数据收集与训练过程。在本研究中,我们提出GradSafe方法,通过审查LLMs中安全关键参数的梯度,有效检测不安全提示。我们的方法基于一个关键观察:针对与顺从响应配对的不安全提示,LLM损失函数的梯度在某些安全关键参数上呈现出相似模式;相比之下,安全提示则导致显著不同的梯度模式。基于这一观察,GradSafe通过分析(与顺从响应配对的)提示的梯度,准确检测不安全提示。我们证明,无需额外训练即可应用于Llama-2的GradSafe,在检测不安全提示方面优于经过大规模数据集微调的Llama Guard。这一优越性能在零样本和适应性场景中均保持一致,正如我们在ToxicChat和XSTest上的评估所证实。源代码可从https://github.com/xyq7/GradSafe获取。