We expose a critical limitation in current approaches to machine unlearning in language models: despite the apparent success of unlearning algorithms, information about the forgotten data remains linearly decodable from internal representations. To systematically assess this discrepancy, we introduce an interpretable, information-theoretic framework for auditing unlearning using Partial Information Decomposition (PID). By comparing model representations before and after unlearning, we decompose the mutual information with the forgotten data into distinct components, formalizing the notions of unlearned and residual knowledge. Our analysis reveals that redundant information, shared across both models, constitutes residual knowledge that persists post-unlearning and correlates with susceptibility to known adversarial reconstruction attacks. Leveraging these insights, we propose a representation-based risk score that can guide abstention on sensitive inputs at inference time, providing a practical mechanism to mitigate privacy leakage. Our work introduces a principled, representation-level audit for unlearning, offering theoretical insight and actionable tools for safer deployment of language models.
翻译:我们揭示了当前语言模型机器遗忘方法的一个关键局限:尽管遗忘算法在表面上取得了成功,但关于被遗忘数据的信息仍可从内部表示中线性解码。为系统评估这一差异,我们引入了一个基于部分信息分解的可解释信息论框架,用于审计遗忘过程。通过比较遗忘前后的模型表示,我们将与被遗忘数据的互信息分解为不同成分,从而形式化了“已遗忘知识”与“残留知识”的概念。我们的分析表明,冗余信息——即在两个模型间共享的信息——构成了遗忘后仍持续存在的残留知识,并与已知对抗性重建攻击的易感性相关。基于这些发现,我们提出了一种基于表示的风险评分,可在推理阶段指导模型对敏感输入进行弃权判断,为缓解隐私泄露提供了实用机制。本研究为遗忘过程引入了原则性的表示层审计方法,为语言模型的安全部署提供了理论洞见和可操作工具。