A Comprehensive Study of Privacy Risks in Curriculum Learning

Training a machine learning model with data following a meaningful order, i.e., from easy to hard, has been proven to be effective in accelerating the training process and achieving better model performance. The key enabling technique is curriculum learning (CL), which has seen great success and has been deployed in areas like image and text classification. Yet, how CL affects the privacy of machine learning is unclear. Given that CL changes the way a model memorizes the training data, its influence on data privacy needs to be thoroughly evaluated. To fill this knowledge gap, we perform the first study and leverage membership inference attack (MIA) and attribute inference attack (AIA) as two vectors to quantify the privacy leakage caused by CL. Our evaluation of nine real-world datasets with attack methods (NN-based, metric-based, label-only MIA, and NN-based AIA) revealed new insights about CL. First, MIA becomes slightly more effective when CL is applied, but the impact is much more prominent to a subset of training samples ranked as difficult. Second, a model trained under CL is less vulnerable under AIA, compared to MIA. Third, the existing defense techniques like DP-SGD, MemGuard, and MixupMMD are still effective under CL, though DP-SGD has a significant impact on target model accuracy. Finally, based on our insights into CL, we propose a new MIA, termed Diff-Cali, which exploits the difficulty scores for result calibration and is demonstrated to be effective against all CL methods and the normal training method. With this study, we hope to draw the community's attention to the unintended privacy risks of emerging machine-learning techniques and develop new attack benchmarks and defense solutions.

翻译：按照有意义的顺序（即从易到难）训练机器学习模型已被证明能有效加速训练过程并提升模型性能。核心使能技术——课程学习（CL）在图像分类和文本分类等领域取得了巨大成功并被广泛部署。然而，课程学习如何影响机器学习的隐私尚不明确。鉴于课程学习改变了模型记忆训练数据的方式，其对于数据隐私的影响需要被彻底评估。为填补这一知识空白，我们开展了首次研究，利用成员推理攻击（MIA）和属性推理攻击（AIA）作为两个向量来量化课程学习导致的隐私泄露。我们在九个真实数据集上使用攻击方法（基于神经网络的、基于度量的、仅标签的MIA以及基于神经网络的人工智能攻击）进行的评估揭示了关于课程学习的新见解。首先，当应用课程学习时，MIA的成功率略有提升，但这种影响对于被标记为“困难”的训练样本子集更为显著。其次，与MIA相比，在课程学习下训练的模型对AIA的脆弱性更低。第三，现有的防御技术如DP-SGD、MemGuard和MixupMMD在课程学习下仍然有效，尽管DP-SGD对目标模型准确率有显著影响。最后，基于对课程学习的洞察，我们提出了一种名为Diff-Cali的新型MIA方法，该方法利用难度分数进行结果校准，并被证明对所有课程学习方法和正常训练方法均有效。通过本研究，我们希望引起学界对新兴机器学习技术中非预期隐私风险的关注，并推动开发新的攻击基准与防御方案。