This paper investigates using knowledge editing techniques to detoxify Large Language Models (LLMs). We construct a benchmark, SafeEdit, which covers nine unsafe categories with various powerful attack prompts and equips comprehensive metrics for systematic evaluation. We conduct experiments with several knowledge editing approaches, indicating that knowledge editing has the potential to detoxify LLMs with a limited impact on general performance efficiently. Then, we propose a simple yet effective baseline, dubbed Detoxifying with Intraoperative Neural Monitoring (DINM), to diminish the toxicity of LLMs within a few tuning steps via only one instance. We further provide an in-depth analysis of the internal mechanism for various detoxifying approaches, demonstrating that previous methods like SFT and DPO may merely suppress the activations of toxic parameters, while DINM mitigates the toxicity of the toxic parameters to a certain extent, making permanent adjustments. We hope that these insights could shed light on future work of developing detoxifying approaches and the underlying knowledge mechanisms of LLMs. Code and benchmark are available at https://github.com/zjunlp/EasyEdit.
翻译:本文探讨了利用知识编辑技术对大语言模型进行去毒化的方法。我们构建了一个名为SafeEdit的基准数据集,覆盖九个不安全类别,包含多种强大的攻击提示,并配备了全面的评估指标以进行系统评价。我们采用多种知识编辑方法进行实验,结果表明知识编辑能够在有限影响模型整体性能的前提下,高效地实现大语言模型的去毒化。在此基础上,我们提出了一种简单而有效的基线方法,称为基于术中神经监测的去毒化方法(DINM),该方法仅需一个实例,通过少量调优步骤即可降低大语言模型的毒性。我们进一步对各种去毒化方法的内部机制进行了深入分析,证明SFT和DPO等传统方法可能仅抑制毒性参数的激活,而DINM则能在一定程度上削弱毒性参数的有害性,实现永久性调整。我们期望这些见解能为未来开发去毒化方法及理解大语言模型内部知识机制的研究提供启示。代码与基准数据集已开源至https://github.com/zjunlp/EasyEdit。