Knowledge-Editing-based (KE-based) detoxification has emerged as a promising approach for mitigating harmful behaviours in Large Language Models. Existing evaluations, however, largely rely on automatic toxicity classifiers, implicitly assuming that reduced toxicity scores reflect genuine behavioural suppression. In this work, we propose a robustness-oriented evaluation framework for KE-based detoxification that examines its reliability beyond standard classifier-based metrics along three dimensions: optimisation robustness, compositional robustness, and cross-lingual robustness. We identify pseudo-detoxification as a common failure mode, where apparent toxicity reductions arise from degenerate generation behaviours rather than meaningful suppression of unsafe content. We further show that detoxification effectiveness degrades when multiple unsafe behaviours are edited jointly, and that both monolingual and cross-lingual detoxification remain effective only under specific model-method combinations. Overall, our results indicate that KE-based detoxification is robust only for certain models, limited numbers of detoxification objectives, and a subset of languages.
翻译:基于知识编辑的去毒化已成为缓解大型语言模型有害行为的一种有前景的方法。然而,现有评估主要依赖自动毒性分类器,隐含地假设毒性分数的降低反映了真实的行为抑制。在本工作中,我们提出了一个面向鲁棒性的评估框架,用于评估基于知识编辑的去毒化,该框架从三个维度考察其在标准分类器指标之外的可靠性:优化鲁棒性、组合鲁棒性和跨语言鲁棒性。我们识别出伪去毒化作为一种常见的失效模式,即表面上的毒性降低源于退化的生成行为,而非对不安全内容的有意义抑制。我们进一步表明,当多个不安全行为被联合编辑时,去毒化效果会下降;并且无论是单语言还是跨语言去毒化,其有效性仅在特定的模型-方法组合下才能保持。总体而言,我们的结果表明,基于知识编辑的去毒化仅对特定模型、有限数量的去毒目标以及部分语言具有鲁棒性。