Large language models (LLMs) can internalize private or harmful content, motivating unlearning that removes a forget set while preserving retaining knowledge. However, forgetting updates often cause collateral degradation on retaining knowledge, creating a persistent trade-off. Existing LLM unlearning methods are often heuristic, and other theoretical approaches rely on offline feature constructions that do not capture update-time forget-retain interaction in LLMs. To address this limitation, we aim to develop an LLM unlearning method that reduces the forget-retain trade-off with theoretical guarantees. We take a first-principles view by formalizing "no side effects" as local retain invariance under small parameter updates, and prove an equivalence under optimizer-induced geometry: the retain loss is locally invariant if and only if the update direction is orthogonal to the subspace spanned by retain gradients. Based on the insight, we propose Geometric-disentanglement Unlearning (GU), a lightweight and theoretically grounded projection that can be plug-and-play to existing gradient-based unlearning methods to mitigate forget-retain side effects. Experiments on TOFU, MUSE, and WMDP-cyber show that GU strengthens forgetting while reducing retain drift. When added to SimNPO, it achieves up to 62\% improved forgetting Extraction Strength (ES) and 31\% higher retain ES. We open-sourced our code in https://github.com/Lemutisme/Geometric-Unlearning.
翻译:大型语言模型(LLMs)可能内化私有或有害内容,这推动了遗忘学习的发展,其目标是在移除遗忘集的同时保留既有知识。然而,遗忘更新常常导致保留知识的附带退化,形成持续的权衡。现有的LLM遗忘方法通常是启发式的,而其他理论方法依赖于离线特征构建,无法捕捉LLMs更新时遗忘与保留之间的相互作用。为应对这一局限,我们旨在开发一种具有理论保证的LLM遗忘方法,以减少遗忘与保留之间的权衡。我们从第一性原理出发,将“无副作用”形式化为小参数更新下的局部保留不变性,并在优化器诱导的几何结构下证明了一个等价关系:当且仅当更新方向与保留梯度张成的子空间正交时,保留损失在局部保持不变。基于这一洞见,我们提出了几何解缠遗忘学习(GU),这是一种轻量级且理论严谨的投影方法,可作为即插即用模块集成到现有的基于梯度的遗忘学习方法中,以减轻遗忘与保留的副作用。在TOFU、MUSE和WMDP-cyber数据集上的实验表明,GU在强化遗忘的同时减少了保留漂移。当将其与SimNPO结合使用时,遗忘提取强度(ES)最高可提升62%,保留ES提高31%。我们的代码已在https://github.com/Lemutisme/Geometric-Unlearning开源。