Machine unlearning strives to uphold the data owners' right to be forgotten by enabling models to selectively forget specific data. Recent advances suggest pre-computing and storing statistics extracted from second-order information and implementing unlearning through Newton-style updates. However, the Hessian matrix operations are extremely costly and previous works conduct unlearning for empirical risk minimizer with the convexity assumption, precluding their applicability to high-dimensional over-parameterized models and the nonconvergence condition. In this paper, we propose an efficient Hessian-free unlearning approach. The key idea is to maintain a statistical vector for each training data, computed through affine stochastic recursion of the difference between the retrained and learned models. We prove that our proposed method outperforms the state-of-the-art methods in terms of the unlearning and generalization guarantees, the deletion capacity, and the time/storage complexity, under the same regularity conditions. Through the strategy of recollecting statistics for removing data, we develop an online unlearning algorithm that achieves near-instantaneous data removal, as it requires only vector addition. Experiments demonstrate that our proposed scheme surpasses existing results by orders of magnitude in terms of time/storage costs with millisecond-level unlearning execution, while also enhancing test accuracy.
翻译:机器学习遗忘旨在通过使模型能够选择性地遗忘特定数据,以维护数据所有者被遗忘的权利。近期研究进展提出预先计算并存储从二阶信息中提取的统计量,并通过牛顿式更新实现遗忘。然而,海森矩阵运算成本极高,且先前工作仅针对具有凸性假设的经验风险最小化器进行遗忘,这限制了其在高维过参数化模型及非收敛条件下的适用性。本文提出一种高效的免海森矩阵遗忘方法。其核心思想是为每个训练数据维护一个统计向量,该向量通过重训练模型与已学习模型之间差异的仿射随机递归计算得到。我们证明,在相同正则性条件下,所提方法在遗忘与泛化保证、删除容量及时间/存储复杂度方面均优于现有最优方法。通过采用针对待删除数据的统计量重收集策略,我们开发了一种在线遗忘算法,该算法仅需向量加法即可实现近瞬时数据删除。实验表明,所提方案在时间/存储成本方面超越现有结果数个数量级,遗忘执行时间达毫秒级,同时提升了测试精度。