Machine unlearning aims to remove specific content from trained models while preserving overall performance. However, the phenomenon of benign relearning, in which forgotten information reemerges even from benign fine-tuning data, reveals that existing unlearning methods remain fundamentally fragile. A common explanation attributes this effect to topical relevance, but we find this account insufficient. Through systematic analysis, we demonstrate that syntactic similarity, rather than topicality, is the primary driver: across benchmarks, syntactically similar data consistently trigger recovery even without topical overlap, due to their alignment in representations and gradients with the forgotten content. Motivated by this insight, we introduce syntactic diversification, which paraphrases the original forget queries into heterogeneous structures prior to unlearning. This approach effectively suppresses benign relearning, accelerates forgetting, and substantially alleviates the trade-off between unlearning efficacy and model utility.
翻译:机器遗忘旨在从训练好的模型中移除特定内容,同时保持整体性能。然而,良性再学习现象——即被遗忘的信息甚至从良性的微调数据中重新出现——揭示了现有遗忘方法从根本上仍然脆弱。一种常见的解释将此效应归因于主题相关性,但我们发现这种解释并不充分。通过系统分析,我们证明句法相似性而非主题相关性是主要驱动因素:在多个基准测试中,句法相似的数据即使在没有主题重叠的情况下也会持续触发信息恢复,这是因为它们在表示和梯度上与遗忘内容对齐。基于这一洞见,我们引入了句法多样化方法,该方法在遗忘前将原始遗忘查询改写为异构结构。此方法有效抑制了良性再学习,加速了遗忘过程,并显著缓解了遗忘效果与模型实用性之间的权衡。