Machine Unlearning (MU) enables Large Language Models (LLMs) to remove unsafe or outdated information. However, existing work assumes that all facts are equally forgettable and largely ignores whether the forgotten knowledge originates from pretraining or supervised fine-tuning (SFT). In this paper, we introduce DUAL (Dual Unlearning Evaluation across Training Stages), a benchmark of 28.6k Wikidata-derived triplets annotated with fact popularity using Wikipedia link counts and LLM-based salience scores. Our experiments show that pretrained and SFT models respond differently to unlearning. An SFT step on the forget data yields smoother forgetting, more stable tuning, and 10-50% higher retention, while direct unlearning on pretrained models remains unstable and prone to relearning or catastrophic forgetting.
翻译:机器遗忘(MU)使大型语言模型(LLM)能够移除不安全或过时的信息。然而,现有研究通常假定所有事实具有同等的可遗忘性,且大多忽略了被遗忘知识究竟源自预训练阶段还是监督微调(SFT)阶段。本文提出DUAL(训练阶段双重遗忘评估基准),该基准包含28.6万个基于Wikidata构建的三元组,并通过维基百科链接计数与基于LLM的显著性评分对事实流行度进行标注。实验表明,预训练模型与SFT模型对遗忘操作的反应存在显著差异:在待遗忘数据上进行SFT步骤可实现更平滑的遗忘、更稳定的调优以及10-50%的知识保留率提升;而直接在预训练模型上进行遗忘操作则仍面临稳定性不足、易发生再学习或灾难性遗忘的问题。