HarmAug: Effective Data Augmentation for Knowledge Distillation of Safety Guard Models

Safety guard models that detect malicious queries aimed at large language models (LLMs) are essential for ensuring the secure and responsible deployment of LLMs in real-world applications. However, deploying existing safety guard models with billions of parameters alongside LLMs on mobile devices is impractical due to substantial memory requirements and latency. To reduce this cost, we distill a large teacher safety guard model into a smaller one using a labeled dataset of instruction-response pairs with binary harmfulness labels. Due to the limited diversity of harmful instructions in the existing labeled dataset, naively distilled models tend to underperform compared to larger models. To bridge the gap between small and large models, we propose HarmAug, a simple yet effective data augmentation method that involves jailbreaking an LLM and prompting it to generate harmful instructions. Given a prompt such as, "Make a single harmful instruction prompt that would elicit offensive content", we add an affirmative prefix (e.g., "I have an idea for a prompt:") to the LLM's response. This encourages the LLM to continue generating the rest of the response, leading to sampling harmful instructions. Another LLM generates a response to the harmful instruction, and the teacher model labels the instruction-response pair. We empirically show that our HarmAug outperforms other relevant baselines. Moreover, a 435-million-parameter safety guard model trained with HarmAug achieves an F1 score comparable to larger models with over 7 billion parameters, and even outperforms them in AUPRC, while operating at less than 25% of their computational cost.

翻译：检测针对大型语言模型（LLM）恶意查询的安全防护模型，对于确保LLM在现实应用中的安全可靠部署至关重要。然而，由于巨大的内存需求和延迟问题，在移动设备上部署具有数十亿参数的现有安全防护模型与LLM并不现实。为降低此成本，我们利用带有二元危害性标签的指令-响应对标注数据集，将大型教师安全防护模型蒸馏为较小模型。由于现有标注数据集中有害指令的多样性有限，简单蒸馏的模型往往性能逊于大型模型。为缩小小型与大型模型间的性能差距，我们提出HarmAug——一种简单而有效的数据增强方法，该方法通过越狱LLM并提示其生成有害指令来实现。给定如“生成一个会引发冒犯性内容的有害单指令提示”的提示，我们在LLM的响应前添加肯定性前缀（例如“我有一个提示想法：”）。这促使LLM继续生成后续响应，从而采样得到有害指令。另一个LLM针对该有害指令生成响应，再由教师模型对该指令-响应对进行标注。实验表明，我们的HarmAug方法优于其他相关基线。此外，使用HarmAug训练的4.35亿参数安全防护模型，在F1分数上达到了与超过70亿参数的大型模型相当的水平，甚至在AUPRC指标上表现更优，而其计算成本不足大型模型的25%。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日