Large language models (LLMs) enhance security through alignment when widely used, but remain susceptible to jailbreak attacks capable of producing inappropriate content. Jailbreak detection methods show promise in mitigating jailbreak attacks through the assistance of other models or multiple model inferences. However, existing methods entail significant computational costs. In this paper, we first present a finding that the difference in output distributions between jailbreak and benign prompts can be employed for detecting jailbreak prompts. Based on this finding, we propose a Free Jailbreak Detection (FJD) which prepends an affirmative instruction to the input and scales the logits by temperature to further distinguish between jailbreak and benign prompts through the confidence of the first token. Furthermore, we enhance the detection performance of FJD through the integration of virtual instruction learning. Extensive experiments on aligned LLMs show that our FJD can effectively detect jailbreak prompts with almost no additional computational costs during LLM inference.
翻译:大型语言模型(LLM)在广泛使用时通过对齐机制增强了安全性,但仍易受越狱攻击的影响,可能生成不当内容。越狱检测方法通过借助其他模型或多重模型推理,在缓解越狱攻击方面展现出潜力。然而,现有方法通常伴随着高昂的计算成本。本文首先提出一项发现:越狱提示与良性提示的输出分布差异可用于检测越狱提示。基于此发现,我们提出了一种零成本越狱检测方法(FJD),该方法通过在输入前添加肯定性指令并利用温度参数缩放对数概率,进一步通过首个令牌的置信度区分越狱提示与良性提示。此外,我们通过集成虚拟指令学习技术提升了FJD的检测性能。在已对齐LLM上的大量实验表明,我们的FJD方法能够在LLM推理过程中几乎不增加额外计算成本的情况下,有效检测越狱提示。