Counterfactual tuning (CFT) has emerged as a promising paradigm for Large Language Model (LLM) unlearning by training models to generate alternative fictitious knowledge in place of undesired content. However, in this work, we find that this paradigm still underperforms other paradigms in some aspects, and identify two previously overlooked pitfalls underlying this gap: (1) knowledge conflict, where mutual inconsistencies within counterfactual corpora induce conflicting gradients that disrupt parameter optimization, and (2) hallucination spillover, where fitting false targets instills a persistent fabrication bias, inflating hallucination rates on unrelated domains. To systematically diagnose these issues, we introduce RWKU+, an extended benchmark equipped with novel trade-off metrics and gradient-level diagnostic tools. Our work further discusses the limitations and overhead of the paradigm, aiming to provide insights and actionable guidance for more rigorous LLM unlearning research.
翻译:反事实微调(CFT)已成为大型语言模型(LLM)知识反演的一种有前途的范式,它通过训练模型生成替代的虚构知识来取代不需要的内容。然而,本研究发现,该范式在某些方面仍逊于其他范式,并揭示了此前被忽视的两个深层缺陷:(1) 知识冲突:反事实语料库内部的相互不一致性导致梯度冲突,从而破坏参数优化;(2) 幻觉扩散:拟合虚假目标会植入持续的捏造偏差,导致无关领域幻觉率上升。为系统诊断这些问题,我们提出RWKU+扩展基准,配备新型权衡指标与梯度级诊断工具。本文进一步讨论该范式的局限性及开销,旨在为更严谨的LLM知识反演研究提供洞见与可行指导。