Differentially private (DP) synthetic data sets are a solution for sharing data while preserving the privacy of individual data providers. Understanding the effects of utilizing DP synthetic data in end-to-end machine learning pipelines impacts areas such as health care and humanitarian action, where data is scarce and regulated by restrictive privacy laws. In this work, we investigate the extent to which synthetic data can replace real, tabular data in machine learning pipelines and identify the most effective synthetic data generation techniques for training and evaluating machine learning models. We investigate the impacts of differentially private synthetic data on downstream classification tasks from the point of view of utility as well as fairness. Our analysis is comprehensive and includes representatives of the two main types of synthetic data generation algorithms: marginal-based and GAN-based. To the best of our knowledge, our work is the first that: (i) proposes a training and evaluation framework that does not assume that real data is available for testing the utility and fairness of machine learning models trained on synthetic data; (ii) presents the most extensive analysis of synthetic data set generation algorithms in terms of utility and fairness when used for training machine learning models; and (iii) encompasses several different definitions of fairness. Our findings demonstrate that marginal-based synthetic data generators surpass GAN-based ones regarding model training utility for tabular data. Indeed, we show that models trained using data generated by marginal-based algorithms can exhibit similar utility to models trained using real data. Our analysis also reveals that the marginal-based synthetic data generator MWEM PGM can train models that simultaneously achieve utility and fairness characteristics close to those obtained by models trained with real data.
翻译:差分隐私(DP)合成数据集是一种在保护数据提供者个人隐私的同时共享数据的解决方案。理解在端到端机器学习流程中使用DP合成数据的影响,对医疗和人道主义行动等数据稀缺且受严格隐私法律约束的领域具有重要意义。本研究探讨了合成数据在多大程度上可以替代机器学习流程中的真实表格数据,并识别出用于训练和评估机器学习模型的最有效合成数据生成技术。我们从效用性和公平性两个角度,研究了差分隐私合成数据对下游分类任务的影响。我们的分析是全面的,涵盖了两种主要合成数据生成算法的代表:基于边缘分布的算法和基于生成对抗网络(GAN)的算法。据我们所知,本研究首次:(i) 提出了一种不假设真实数据可用于测试基于合成数据训练的机器学习模型效用性与公平性的训练和评估框架;(ii) 在用于训练机器学习模型时,从效用性和公平性角度对合成数据集生成算法进行了最广泛的分析;(iii) 涵盖了多种不同的公平性定义。我们的研究结果表明,在表格数据的模型训练效用性方面,基于边缘分布的合成数据生成器优于基于GAN的生成器。事实上,我们证明,使用基于边缘分布算法生成的数据训练的模型可以表现出与使用真实数据训练的模型相似的效用性。我们的分析还揭示,基于边缘分布的合成数据生成器MWEM PGM能够训练出在效用性和公平性特征上均接近使用真实数据训练所得模型的模型。