This work investigates and evaluates multiple defense strategies against property inference attacks (PIAs), a privacy attack against machine learning models. Given a trained machine learning model, PIAs aim to extract statistical properties of its underlying training data, e.g., reveal the ratio of men and women in a medical training data set. While for other privacy attacks like membership inference, a lot of research on defense mechanisms has been published, this is the first work focusing on defending against PIAs. With the primary goal of developing a generic mitigation strategy against white-box PIAs, we propose the novel approach property unlearning. Extensive experiments with property unlearning show that while it is very effective when defending target models against specific adversaries, property unlearning is not able to generalize, i.e., protect against a whole class of PIAs. To investigate the reasons behind this limitation, we present the results of experiments with the explainable AI tool LIME. They show how state-of-the-art property inference adversaries with the same objective focus on different parts of the target model. We further elaborate on this with a follow-up experiment, in which we use the visualization technique t-SNE to exhibit how severely statistical training data properties are manifested in machine learning models. Based on this, we develop the conjecture that post-training techniques like property unlearning might not suffice to provide the desirable generic protection against PIAs. As an alternative, we investigate the effects of simpler training data preprocessing methods like adding Gaussian noise to images of a training data set on the success rate of PIAs. We conclude with a discussion of the different defense approaches, summarize the lessons learned and provide directions for future work.
翻译:本研究系统性地调查并评估了多种针对属性推断攻击(Property Inference Attacks,PIAs)的防御策略,此类攻击是一种针对机器学习模型的隐私威胁。给定一个训练好的机器学习模型,PIAs旨在提取其底层训练数据的统计属性(例如,揭示医疗训练数据集中男女比例)。尽管其他隐私攻击(如成员推断攻击)已有大量关于防御机制的研究,本文是首项聚焦于应对PIAs的工作。为实现抵御白盒PIAs的通用缓解策略这一主要目标,我们提出了新颖的"属性遗忘"方法。大量实验表明,属性遗忘在防御目标模型对抗特定攻击者时非常有效,但无法泛化(即无法保护模型免受整类PIAs的攻击)。为探究这一局限性的深层原因,我们借助可解释人工智能工具LIME进行了实验分析,结果显示:具有相同目标的最先进属性推断攻击者会聚焦于目标模型的不同部分。我们通过后续实验进一步阐明这一点,利用可视化技术t-SNE展示了统计数据属性在机器学习模型中的严重烙印程度。基于此,我们提出猜想:属性遗忘等训练后处理技术可能不足以提供期望的通用PIAs防护。作为替代方案,我们研究了简单的训练数据预处理方法(如向训练集图像添加高斯噪声)对PIAs成功率的影响。最后,我们讨论了不同防御方法的优劣,总结了经验教训,并为未来研究指明了方向。