There is a growing demand for efficient data removal to comply with regulations like the GDPR and to mitigate the influence of biased or corrupted data. This has motivated the field of machine unlearning, which aims to eliminate the influence of specific data subsets without the cost of full retraining. In this work, we propose a statistical framework for machine unlearning with generic loss functions and establish theoretical guarantees. For squared loss, especially, we develop Unlearning Least Squares (ULS) and establish its minimax optimality for estimating the model parameter of remaining data when only the pre-trained estimator, forget samples, and a small subsample of the remaining data are available. Our results reveal that the estimation error decomposes into an oracle term and an unlearning cost determined by the forget proportion and the forget model bias. We further establish asymptotically valid inference procedures without requiring full retraining. Numerical experiments and real-data applications demonstrate that the proposed method achieves performance close to retraining while requiring substantially less data access.
翻译:随着GDPR等法规对数据删除的需求日益增长,以及消除偏差或受损数据影响的需要,高效数据移除技术备受关注。这推动了机器遗忘领域的发展,旨在消除特定数据子集的影响,同时避免完全重新训练的高昂成本。本文针对一般损失函数提出了机器遗忘的统计框架,并建立了理论保证。特别地,对于平方损失函数,我们开发了遗忘最小二乘法(ULS),并证明了该方法在仅使用预训练估计器、遗忘样本和少量剩余数据子样本的情况下,对剩余数据模型参数估计具有极小化最优性。研究结果表明,估计误差可分解为预言项和由遗忘比例与遗忘模型偏差决定的遗忘成本。我们进一步建立了无需完全重新训练的渐近有效推断程序。数值实验和实际数据应用表明,所提方法在显著减少数据访问量的同时,实现了接近重新训练的性能。