Recently, it has been shown that Machine Learning models can leak sensitive information about their training data. This information leakage is exposed through membership and attribute inference attacks. Although many attack strategies have been proposed, little effort has been made to formalize these problems. We present a novel formalism, generalizing membership and attribute inference attack setups previously studied in the literature and connecting them to memorization and generalization. First, we derive a universal bound on the success rate of inference attacks and connect it to the generalization gap of the target model. Second, we study the question of how much sensitive information is stored by the algorithm about its training set and we derive bounds on the mutual information between the sensitive attributes and model parameters. Experimentally, we illustrate the potential of our approach by applying it to both synthetic data and classification tasks on natural images. Finally, we apply our formalism to different attribute inference strategies, with which an adversary is able to recover the identity of writers in the PenDigits dataset.
翻译:近期研究表明,机器学习模型可能泄露其训练数据中的敏感信息。这种信息泄露通过成员推断攻击和属性推断攻击暴露出来。尽管已有多种攻击策略被提出,但鲜有研究将这些攻击问题进行形式化。我们提出了一种新颖的形式化框架,对文献中研究的成员推断与属性推断攻击场景进行统一归纳,并将其与记忆化和泛化现象建立关联。首先,我们推导出推断攻击成功率的通用上界,并将其与目标模型的泛化差距相关联。其次,我们研究算法存储关于其训练集的敏感信息量问题,并推导出敏感属性与模型参数之间的互信息上界。实验上,我们通过将方法应用于合成数据和自然图像分类任务,验证了该方法的潜力。最后,我们将该形式化框架应用于不同的属性推断策略——通过这些策略,攻击者能够恢复PenDigits数据集中书写者的身份信息。