This paper investigates using knowledge editing techniques to detoxify Large Language Models (LLMs). We construct a benchmark, SafeEdit, which covers nine unsafe categories with various powerful attack prompts and equips comprehensive metrics for systematic evaluation. We conduct experiments to compare knowledge editing approaches with previous baselines, indicating that knowledge editing has the potential to efficiently detoxify LLMs with limited impact on general performance. Then, we propose a simple yet effective baseline, dubbed Detoxifying with Intraoperative Neural Monitoring (DINM), to diminish the toxicity of LLMs within a few tuning steps via only one instance. We further provide an in-depth analysis of the internal mechanism for various detoxify approaches, demonstrating that previous methods like SFT and DPO may merely suppress the activations of toxic parameters, while DINM mitigates the toxicity of the toxic parameters to a certain extent, making permanent adjustments. We hope that these insights could shed light on future work of developing detoxifying approaches and the underlying knowledge mechanisms of LLMs. Code and benchmark are available at https://github.com/zjunlp/EasyEdit.
翻译:本文探究利用知识编辑技术为大型语言模型(LLMs)"解毒"的方法。我们构建了一个基准数据集SafeEdit,涵盖九类不安全场景,配备多种强大的攻击提示词与全面的评估指标,以进行系统性评价。通过实验对比知识编辑方法与现有基线方案,结果表明知识编辑能在对模型通用性能影响有限的条件下,高效降低LLMs的有害性。进而,我们提出了一种简洁有效的基线方法,称为"术中神经监测解毒法"(DINM),该方法仅需单条实例即可在数步微调内显著降低LLMs的有害输出。我们进一步深入剖析了多种解毒方法的内部机制,揭示出监督微调(SFT)和直接偏好优化(DPO)等传统方法可能仅抑制有毒参数的激活,而DINM能在一定程度上削弱有毒参数的本质毒性,实现永久性调整。我们期望这些发现能为未来开发解毒方法及理解LLMs底层知识机制的研究提供启示。相关代码与基准数据集已开源至https://github.com/zjunlp/EasyEdit。