Large Language Model (LLM) alignment remains vulnerable to jailbreak attacks that elicit unsafe responses, motivating pre-model and post-model guards. Pre-model guards audit the safety of prompts before invoking target models. However, relying solely on the prompt often leads to high false-negative rates (i.e., jailbreak attacks go undetected). Post-model guards address this issue by auditing both the user prompt and the target model's response. However, they incur a high computational cost, including increased token usage and processing time, because they operate after target model inference. In this paper, we introduce a safeguard design that leverages the transferability of jailbreak attacks to enforce prompt safety before target model inference. We first conduct a systematic study of jailbreak transferability, particularly from LLMs to small language models (SLMs). Through these experiments, we identify key factors influencing transferability. Building on these insights, we observe that responses from smaller draft models reflect the safety implications of those from large target models; \ie given a jailbreak prompt constructed for an LLM, an SLM is likely to be triggered to generate an unaligned response. Based on this observation, our safeguard design leverages speculative inference with SLMs to generate a set of draft responses. It then feeds the original prompt and these drafts into existing guards to predict their safety. We demonstrate that this design reduces the false-negative rate of pre-model guards and offers a low \Efficiency alternative to post-model guards. \textcolor{red}{\bf Notice: This paper contains examples of harmful language.}
翻译:大语言模型(LLM)的对齐机制仍易受到越狱攻击的威胁,此类攻击会诱导模型生成不安全响应,由此催生了预模型防护与后模型防护策略。预模型防护在调用目标模型前对提示词进行安全审查,然而仅依赖提示词往往导致高漏报率(即越狱攻击未被检测)。后模型防护通过同时审计用户提示词与目标模型响应解决了该问题,但其需在目标模型推理完成后运行,导致计算成本高昂(包括增加令牌消耗与处理时间)。本文提出一种利用越狱攻击可迁移性的安全保障设计,在目标模型推理前即落实提示词安全审查。我们首先对越狱攻击的可迁移性进行了系统性研究,尤其关注从大语言模型向小型语言模型(SLM)的迁移特性。通过实验分析,我们识别出影响迁移性的关键因素。基于这些发现,我们观察到小型草稿模型的响应能反映大型目标模型的安全隐患——即针对大语言模型构建的越狱提示词,很可能触发小型语言模型生成未对齐响应。据此,我们的安全保障设计采用小型语言模型进行推测推理以生成草稿响应集,随后将原始提示词与草稿响应共同输入现有防护机制进行安全性预测。实验证明,该设计能降低预模型防护的漏报率,并为后模型防护提供更高效能的替代方案。\textcolor{red}{\bf 注意:本文包含有害语言示例。}