Safety guard models that detect malicious queries aimed at large language models (LLMs) are essential for ensuring the secure and responsible deployment of LLMs in real-world applications. However, deploying existing safety guard models with billions of parameters alongside LLMs on mobile devices is impractical due to substantial memory requirements and latency. To reduce this cost, we distill a large teacher safety guard model into a smaller one using a labeled dataset of instruction-response pairs with binary harmfulness labels. Due to the limited diversity of harmful instructions in the existing labeled dataset, naively distilled models tend to underperform compared to larger models. To bridge the gap between small and large models, we propose HarmAug, a simple yet effective data augmentation method that involves jailbreaking an LLM and prompting it to generate harmful instructions. Given a prompt such as, "Make a single harmful instruction prompt that would elicit offensive content", we add an affirmative prefix (e.g., "I have an idea for a prompt:") to the LLM's response. This encourages the LLM to continue generating the rest of the response, leading to sampling harmful instructions. Another LLM generates a response to the harmful instruction, and the teacher model labels the instruction-response pair. We empirically show that our HarmAug outperforms other relevant baselines. Moreover, a 435-million-parameter safety guard model trained with HarmAug achieves an F1 score comparable to larger models with over 7 billion parameters, and even outperforms them in AUPRC, while operating at less than 25% of their computational cost.
翻译:检测针对大型语言模型(LLM)恶意查询的安全防护模型,对于确保LLM在现实应用中的安全可靠部署至关重要。然而,由于巨大的内存需求和延迟问题,在移动设备上部署具有数十亿参数的现有安全防护模型与LLM并不现实。为降低此成本,我们利用带有二元危害性标签的指令-响应对标注数据集,将大型教师安全防护模型蒸馏为较小模型。由于现有标注数据集中有害指令的多样性有限,简单蒸馏的模型往往性能逊于大型模型。为缩小小型与大型模型间的性能差距,我们提出HarmAug——一种简单而有效的数据增强方法,该方法通过越狱LLM并提示其生成有害指令来实现。给定如“生成一个会引发冒犯性内容的有害单指令提示”的提示,我们在LLM的响应前添加肯定性前缀(例如“我有一个提示想法:”)。这促使LLM继续生成后续响应,从而采样得到有害指令。另一个LLM针对该有害指令生成响应,再由教师模型对该指令-响应对进行标注。实验表明,我们的HarmAug方法优于其他相关基线。此外,使用HarmAug训练的4.35亿参数安全防护模型,在F1分数上达到了与超过70亿参数的大型模型相当的水平,甚至在AUPRC指标上表现更优,而其计算成本不足大型模型的25%。