With new data privacy laws such as the General Data Protection Regulation (GDPR) [1] that allow individuals to ask that any of their personal information be erased from trained machine learning models, there has been a push to investigate the unlearning of data from models as a way to comply with these laws. In this regard, based on four mechanics, we consider several approximate unlearning strategies applied to the MRBrainS18 dataset [2]. We use a 3D ResNet-50 [3] as a backbone architecture for segmentation that has been pre-trained with the Med3D framework [4]. Considering the pre-trained model as a baseline, we evaluate respective retention accuracy on 2 types of subjects, i.e., retain and forget. We assess these approaches through their Dice similarity coefficient and mean absolute error (MAE) values using two separate training horizons 20 and 50 epochs. The results show that the Noisy Label strategy had the best overall trade-off with a decrease of 93% in the forget set while maintaining 84% accuracy for the retained set after 50 epochs. All other strategies showed extreme levels of forgetting at higher epoch numbers while also demonstrating catastrophic degradation of their retain set performance. The results of this study provide a strict baseline of performance metrics for unlearning on a subject-specific level and provide practitioners with clear criteria for selecting the proper strategies.
翻译:随着《通用数据保护条例》(GDPR)[1]等新数据隐私法律的实施,个人有权要求从其训练好的机器学习模型中删除任何个人信息,这促使人们研究从模型中遗忘数据以遵守这些法律。基于此,我们考虑了四种机制,并针对MRBrainS18数据集[2]应用了几种近似遗忘策略。我们使用3D ResNet-50 [3]作为分割任务的骨干架构,并通过Med3D框架[4]进行预训练。以预训练模型为基准,我们在两类受试者(即保留组和遗忘组)上评估各自的保留准确率。我们通过Dice相似系数和平均绝对误差(MAE)值来评估这些方法,分别采用20和50个训练周期。结果表明,噪声标签策略在50个周期后实现了最佳的整体权衡,遗忘集性能下降93%,同时保留集准确率维持在84%。所有其他策略在较高周期数下表现出极端的遗忘程度,同时其保留集性能出现灾难性退化。本研究结果为在受试者特定水平上实现遗忘提供了严格的性能指标基准,并为实践者选择合适的策略提供了明确标准。