This paper investigates using knowledge editing techniques to detoxify Large Language Models (LLMs). We construct a benchmark, SafeEdit, which covers nine unsafe categories with various powerful attack prompts and equips comprehensive metrics for systematic evaluation. We conduct experiments with several knowledge editing approaches, indicating that knowledge editing has the potential to detoxify LLMs with a limited impact on general performance efficiently. Then, we propose a simple yet effective baseline, dubbed Detoxifying with Intraoperative Neural Monitoring (DINM), to diminish the toxicity of LLMs within a few tuning steps via only one instance. We further provide an in-depth analysis of the internal mechanism for various detoxifying approaches, demonstrating that previous methods like SFT and DPO may merely suppress the activations of toxic parameters, while DINM mitigates the toxicity of the toxic parameters to a certain extent, making permanent adjustments. We hope that these insights could shed light on future work of developing detoxifying approaches and the underlying knowledge mechanisms of LLMs. Code and benchmark are available at https://github.com/zjunlp/EasyEdit.
翻译:本文研究利用知识编辑技术对大语言模型进行去毒处理。我们构建了一个名为SafeEdit的基准测试集,涵盖九种不安全类别,包含多种强效攻击提示,并配备了全面的评估指标以进行系统性评估。通过多种知识编辑方法的实验,我们发现知识编辑能够有效降低大语言模型的毒性,且对模型通用性能影响有限。在此基础上,我们提出了一种简单而有效的基线方法——术中神经监测去毒法,该方法仅需单样本即可在少量调优步数内显著降低大语言模型的毒性。我们进一步深入分析了不同去毒方法的内部机制,研究表明传统方法如SFT和DPO可能仅抑制毒性参数的激活,而DINM方法能在一定程度上永久性调整毒性参数的内在毒性。这些发现有望为未来开发去毒方法及探索大语言模型底层知识机制的研究提供新的视角。代码与基准测试集已开源:https://github.com/zjunlp/EasyEdit。