Why Train More? Effective and Efficient Membership Inference via Memorization

Membership Inference Attacks (MIAs) aim to identify specific data samples within the private training dataset of machine learning models, leading to serious privacy violations and other sophisticated threats. Many practical black-box MIAs require query access to the data distribution (the same distribution where the private data is drawn) to train shadow models. By doing so, the adversary obtains models trained "with" or "without" samples drawn from the distribution, and analyzes the characteristics of the samples under consideration. The adversary is often required to train more than hundreds of shadow models to extract the signals needed for MIAs; this becomes the computational overhead of MIAs. In this paper, we propose that by strategically choosing the samples, MI adversaries can maximize their attack success while minimizing the number of shadow models. First, our motivational experiments suggest memorization as the key property explaining disparate sample vulnerability to MIAs. We formalize this through a theoretical bound that connects MI advantage with memorization. Second, we show sample complexity bounds that connect the number of shadow models needed for MIAs with memorization. Lastly, we confirm our theoretical arguments with comprehensive experiments; by utilizing samples with high memorization scores, the adversary can (a) significantly improve its efficacy regardless of the MIA used, and (b) reduce the number of shadow models by nearly two orders of magnitude compared to state-of-the-art approaches.

翻译：成员推断攻击旨在识别机器学习模型私有训练数据集中的特定数据样本，从而导致严重的隐私侵犯及其他复杂威胁。许多实际的黑盒成员推断攻击需要访问数据分布（即私有数据来源的相同分布）以训练影子模型。通过这种方式，攻击者可以获得基于该分布“包含”或“不包含”样本训练的模型，并分析待考察样本的特征。攻击者通常需要训练数百个以上的影子模型来提取成员推断攻击所需的信号——这构成了成员推断攻击的计算开销。本文提出，通过策略性地选择样本，成员推断攻击者可以在最小化影子模型数量的同时最大化攻击成功率。首先，我们的动机实验表明，记忆性是解释样本对成员推断攻击脆弱性差异的关键属性。我们通过连接成员推断优势与记忆性的理论边界对此进行了形式化。其次，我们展示了将成员推断攻击所需影子模型数量与记忆性关联起来的样本复杂度边界。最后，我们通过全面实验验证了理论论证：通过利用高记忆性得分的样本，攻击者能够（a）无论使用何种成员推断攻击方法均可显著提升攻击效能，且（b）与最先进方法相比，将所需影子模型数量减少近两个数量级。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日