We introduce a low-resource safety enhancement method for aligning large language models (LLMs) without the need for supervised fine-tuning (SFT) or reinforcement learning from human feedback (RLHF). Our main idea is to exploit knowledge distillation to extract the alignment information from existing well-aligned LLMs and integrate it into unaligned LLMs in a plug-and-play fashion. Methodology, we employ delta debugging to identify the critical components of knowledge necessary for effective distillation. On the harmful question dataset, our method significantly enhances the average defense success rate by approximately 14.41%, reaching as high as 51.39%, in 17 unaligned pre-trained LLMs, without compromising performance.
翻译:我们提出一种低资源安全增强方法,用于对齐大语言模型(LLMs),无需监督微调(SFT)或基于人类反馈的强化学习(RLHF)。核心思想是利用知识蒸馏从现有良好对齐的LLMs中提取对齐信息,并以即插即用方式集成至未对齐LLMs中。在方法论层面,我们采用差异调试技术识别有效蒸馏所需的关键知识组件。在有害问题数据集上,该方法在不牺牲性能的前提下,将17个未对齐预训练LLMs的平均防御成功率显著提升约14.41%,最高达51.39%。