Natively Unlearnable Large Language Models

Unlearning aims to remove the influence of specific training data sources, but this has proved challenging because the contributions of different sources are entangled within the model. Isolating source contributions to disjoint parameters makes removal easier, though it obstructs joint learning across sources. We propose NULLs (Natively Unlearnable LLMs), a model class that satisfies the two opposing goals of isolating source-specific contributions and learning jointly across sources, by training a set of shared backbone neurons alongside a pool of sparsely activated sinks. During training, information specific to a source naturally concentrates in its sinks while information shared across sources accumulates in the backbone. A source is then unlearned at deployment by disabling its corresponding sinks, with no gradient updates and no access to the retained data. We show that NULLs scales to Wikipedia's ~6M articles, isolating each as an independent source. Unlearning a single article removes knowledge specific to it while preserving facts shared with semantically related articles, closely matching retraining from scratch. We note that unlearning with NULLs is also robust: in a case study of unlearning the Harry Potter books, NULLs resists both adversarial extraction and relearning that reverses post-hoc unlearning. Finally, NULLs preserves general language capabilities, matching a standard transformer on downstream benchmarks. Together, these results suggest that source-level unlearning need not be an afterthought. It can be built natively into LLM training while retaining the benefits of shared representation learning.

翻译：遗忘旨在消除特定训练数据源的影响，但由于不同数据源的贡献在模型中相互纠缠，这一目标实现起来颇具挑战。将各数据源的贡献隔离到不重叠的参数上虽便于移除，但却阻碍了跨数据源的联合学习。我们提出NULLs（原生不可遗忘的大语言模型），这是一类满足以下两个对立目标的模型：隔离特定数据源的贡献，并实现跨数据源的联合学习。其方法是在训练一组共享骨干神经元的同时，配备一个由稀疏激活的“汇池”。在训练过程中，特定于某一数据源的信息自然集中于其对应的汇中，而跨数据源共享的信息则积累于骨干网络中。在部署时，通过禁用某个数据源对应的汇即可实现对该源的遗忘，无需梯度更新，也无需访问保留数据。我们证明，NULLs可扩展到维基百科约600万篇文章，将每篇文章隔离为独立的数据源。遗忘单篇文章能移除其特有的知识，同时保留与语义相关文章共享的事实，其效果与从头重新训练高度一致。我们注意到，基于NULLs的遗忘也具有鲁棒性：在以遗忘《哈利·波特》系列书籍为案例的研究中，NULLs能够抵御对抗性提取以及逆转事后遗忘的重新学习。最后，NULLs保留了通用的语言能力，在下游基准测试中与标准Transformer模型表现相当。综合这些结果，我们得出结论：数据源层面的遗忘不必作为事后补救措施，它可以在保留共享表征学习优势的同时，原生地构建到大语言模型的训练过程中。