The availability of large amounts of informative data is crucial for successful machine learning. However, in domains with sensitive information, the release of high-utility data which protects the privacy of individuals has proven challenging. Despite progress in differential privacy and generative modeling for privacy-preserving data release in the literature, only a few approaches optimize for machine learning utility: most approaches only take into account statistical metrics on the data itself and fail to explicitly preserve the loss metrics of machine learning models that are to be subsequently trained on the generated data. In this paper, we introduce a data release framework, 3A (Approximate, Adapt, Anonymize), to maximize data utility for machine learning, while preserving differential privacy. We also describe a specific implementation of this framework that leverages mixture models to approximate, kernel-inducing points to adapt, and Gaussian differential privacy to anonymize a dataset, in order to ensure that the resulting data is both privacy-preserving and high utility. We present experimental evidence showing minimal discrepancy between performance metrics of models trained on real versus privatized datasets, when evaluated on held-out real data. We also compare our results with several privacy-preserving synthetic data generation models (such as differentially private generative adversarial networks), and report significant increases in classification performance metrics compared to state-of-the-art models. These favorable comparisons show that the presented framework is a promising direction of research, increasing the utility of low-risk synthetic data release for machine learning.
翻译:大量信息丰富的数据是可获得成功机器学习的关键。然而,在涉及敏感信息的领域,既保护个人隐私又释放高效用数据具有挑战性。尽管差分隐私和生成式建模在隐私保护数据发布方面的文献中取得了进展,但仅有少数方法针对机器学习效用进行了优化:大多数方法仅考虑数据本身的统计指标,未能明确保留随后将在生成数据上训练的机器学习模型的损失指标。本文提出一种数据发布框架——3A(近似、适配、匿名化),旨在最大化机器学习的数据效用,同时保留差分隐私。我们还描述了该框架的一种具体实现,其利用混合模型进行近似、核诱导点进行适配、高斯差分隐私进行数据集匿名化,以确保生成的数据既保护隐私又具有高效用。实验证据表明,在保留的真实数据上评估时,基于真实数据集与私有化数据集训练的模型性能指标差异极小。我们将结果与多种隐私保护合成数据生成模型(如差分隐私生成对抗网络)进行了对比,报告显示分类性能指标相较于现有最优模型有显著提升。这些有利的比较表明,所提出的框架是一个有前景的研究方向,可提升面向机器学习的低风险合成数据发布的效用。