As Large Language Models (LLMs) increasingly shape online content, removing targeted information from well-trained LLMs (also known as LLM unlearning) has become critical for web governance. A key challenge lies in sample-wise imbalance within the forget set: different samples exhibit widely varying unlearning difficulty, leading to asynchronous forgetting where some knowledge remains insufficiently erased while others become over-forgotten. To address this, we propose BalDRO, a novel and efficient framework for balanced LLM unlearning. BalDRO formulates unlearning as a min-sup process: an inner step identifies a worst-case data distribution that emphasizes hard-to-unlearn samples, while an outer step updates model parameters under this distribution. We instantiate BalDRO via two efficient variants: BalDRO-G, a discrete GroupDRO-based approximation focusing on high-loss subsets, and BalDRO-DV, a continuous Donsker-Varadhan dual method enabling smooth adaptive weighting within standard training pipelines. Experiments on TOFU and MUSE show that BalDRO significantly improves both forgetting quality and model utility over existing methods, and we release code for reproducibility.
翻译:随着大语言模型日益主导在线内容生成,从训练完备的模型中移除特定信息(即大语言模型遗忘)已成为网络治理的关键课题。核心挑战在于遗忘集的样本间不平衡性:不同样本的遗忘难度差异显著,导致异步遗忘现象——部分知识残留未被充分擦除,而其他知识则被过度遗忘。为此,我们提出BalDRO,一种新颖高效的平衡式大语言模型遗忘框架。BalDRO将遗忘问题构建为最小-上界优化过程:内层步骤识别强调难遗忘样本的最坏情况数据分布,外层步骤则在此分布下更新模型参数。我们通过两种高效变体实现BalDRO:BalDRO-G采用基于离散分组分布鲁棒优化的近似方法,聚焦高损失子集;BalDRO-DV则基于连续型Donsker-Varadhan对偶方法,可在标准训练流程中实现平滑自适应加权。在TOFU与MUSE数据集上的实验表明,相较于现有方法,BalDRO在遗忘质量与模型效用方面均有显著提升,相关代码已开源以确保可复现性。