In this evolving era of machine learning security, membership inference attacks have emerged as a potent threat to the confidentiality of sensitive data. In this attack, adversaries aim to determine whether a particular point was used during the training of a target model. This paper proposes a new method to gauge a data point's membership in a model's training set. Instead of correlating loss with membership, as is traditionally done, we have leveraged the fact that training examples generally exhibit higher confidence values when classified into their actual class. During training, the model is essentially being 'fit' to the training data and might face particular difficulties in generalization to unseen data. This asymmetry leads to the model achieving higher confidence on the training data as it exploits the specific patterns and noise present in the training data. Our proposed approach leverages the confidence values generated by the machine learning model. These confidence values provide a probabilistic measure of the model's certainty in its predictions and can further be used to infer the membership of a given data point. Additionally, we also introduce another variant of our method that allows us to carry out this attack without knowing the ground truth(true class) of a given data point, thus offering an edge over existing label-dependent attack methods.
翻译:在机器学习安全不断演进的时代,成员推断攻击已成为对敏感数据机密性的重大威胁。在此类攻击中,攻击者旨在判定特定数据点是否在目标模型的训练过程中被使用。本文提出了一种评估数据点是否属于模型训练集的新方法。与传统方法将损失与成员身份关联不同,我们利用了训练样本在分类到其真实类别时通常表现出更高置信度值这一事实。在训练过程中,模型本质上是在“拟合”训练数据,可能对泛化到未见数据存在特定困难。这种不对称性导致模型在训练数据上获得更高的置信度,因为它利用了训练数据中存在的特定模式和噪声。我们提出的方法利用了机器学习模型生成的置信度值。这些置信度值提供了模型对其预测确定性的概率度量,并可进一步用于推断给定数据点的成员身份。此外,我们还引入了本方法的另一种变体,该变体允许我们在不知道给定数据点真实类别的情况下执行此类攻击,从而相较于现有的依赖标签的攻击方法更具优势。