Exploring and Developing a Pre-Model Safeguard with Draft Models

Large Language Model (LLM) alignment remains vulnerable to jailbreak attacks that elicit unsafe responses, motivating pre-model and post-model guards. Pre-model guards audit the safety of prompts before invoking target models. However, relying solely on the prompt often leads to high false-negative rates (i.e., jailbreak attacks go undetected). Post-model guards address this issue by auditing both the user prompt and the target model's response. However, they incur a high computational cost, including increased token usage and processing time, because they operate after target model inference. In this paper, we introduce a safeguard design that leverages the transferability of jailbreak attacks to enforce prompt safety before target model inference. We first conduct a systematic study of jailbreak transferability, particularly from LLMs to small language models (SLMs). Through these experiments, we identify key factors influencing transferability. Building on these insights, we observe that responses from smaller draft models reflect the safety implications of those from large target models; \ie given a jailbreak prompt constructed for an LLM, an SLM is likely to be triggered to generate an unaligned response. Based on this observation, our safeguard design leverages speculative inference with SLMs to generate a set of draft responses. It then feeds the original prompt and these drafts into existing guards to predict their safety. We demonstrate that this design reduces the false-negative rate of pre-model guards and offers a low \Efficiency alternative to post-model guards. \textcolor{red}{\bf Notice: This paper contains examples of harmful language.}

翻译：大语言模型（LLM）的对齐机制仍易受到越狱攻击的威胁，此类攻击会诱导模型生成不安全响应，由此催生了预模型防护与后模型防护策略。预模型防护在调用目标模型前对提示词进行安全审查，然而仅依赖提示词往往导致高漏报率（即越狱攻击未被检测）。后模型防护通过同时审计用户提示词与目标模型响应解决了该问题，但其需在目标模型推理完成后运行，导致计算成本高昂（包括增加令牌消耗与处理时间）。本文提出一种利用越狱攻击可迁移性的安全保障设计，在目标模型推理前即落实提示词安全审查。我们首先对越狱攻击的可迁移性进行了系统性研究，尤其关注从大语言模型向小型语言模型（SLM）的迁移特性。通过实验分析，我们识别出影响迁移性的关键因素。基于这些发现，我们观察到小型草稿模型的响应能反映大型目标模型的安全隐患——即针对大语言模型构建的越狱提示词，很可能触发小型语言模型生成未对齐响应。据此，我们的安全保障设计采用小型语言模型进行推测推理以生成草稿响应集，随后将原始提示词与草稿响应共同输入现有防护机制进行安全性预测。实验证明，该设计能降低预模型防护的漏报率，并为后模型防护提供更高效能的替代方案。\textcolor{red}{\bf 注意：本文包含有害语言示例。}

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

《国防领域安全采用大语言模型的战略蓝图》

专知会员服务

17+阅读 · 6月6日

探索大型语言模型在网络安全中的作用：一项系统综述

专知会员服务

22+阅读 · 2025年4月27日

158页！天大等最新《大型语言模型安全：全面综述》

专知会员服务

50+阅读 · 2024年12月24日

【新书】大规模语言模型的隐私与安全，

专知会员服务

29+阅读 · 2024年12月4日