We introduce a low-resource safety enhancement method for aligning large language models (LLMs) without the need for supervised fine-tuning (SFT) or reinforcement learning from human feedback (RLHF). Our main idea is to exploit knowledge distillation to extract the alignment information from existing well-aligned LLMs and integrate it into unaligned LLMs in a plug-and-play fashion. Methodology, we employ delta debugging to identify the critical components of knowledge necessary for effective distillation. On the harmful question dataset, our method significantly enhances the average defense success rate by approximately 14.41%, reaching as high as 51.39%, in 17 unaligned pre-trained LLMs, without compromising performance.
翻译:我们提出了一种低资源安全增强方法,用于对齐大语言模型(LLMs),而无需监督微调(SFT)或基于人类反馈的强化学习(RLHF)。我们的核心思想是利用知识蒸馏技术,从现有已对齐良好的LLMs中提取对齐信息,并以即插即用的方式将其整合到未对齐的LLMs中。在方法上,我们采用Delta调试技术来识别有效蒸馏所需的关键知识组件。在有害问题数据集上,我们的方法在17个未对齐的预训练LLMs中,将平均防御成功率显著提升了约14.41%,最高可达51.39%,且未损害模型性能。