Assessment of Differentially Private Synthetic Data for Utility and Fairness in End-to-End Machine Learning Pipelines for Tabular Data

Differentially private (DP) synthetic data sets are a solution for sharing data while preserving the privacy of individual data providers. Understanding the effects of utilizing DP synthetic data in end-to-end machine learning pipelines impacts areas such as health care and humanitarian action, where data is scarce and regulated by restrictive privacy laws. In this work, we investigate the extent to which synthetic data can replace real, tabular data in machine learning pipelines and identify the most effective synthetic data generation techniques for training and evaluating machine learning models. We investigate the impacts of differentially private synthetic data on downstream classification tasks from the point of view of utility as well as fairness. Our analysis is comprehensive and includes representatives of the two main types of synthetic data generation algorithms: marginal-based and GAN-based. To the best of our knowledge, our work is the first that: (i) proposes a training and evaluation framework that does not assume that real data is available for testing the utility and fairness of machine learning models trained on synthetic data; (ii) presents the most extensive analysis of synthetic data set generation algorithms in terms of utility and fairness when used for training machine learning models; and (iii) encompasses several different definitions of fairness. Our findings demonstrate that marginal-based synthetic data generators surpass GAN-based ones regarding model training utility for tabular data. Indeed, we show that models trained using data generated by marginal-based algorithms can exhibit similar utility to models trained using real data. Our analysis also reveals that the marginal-based synthetic data generator MWEM PGM can train models that simultaneously achieve utility and fairness characteristics close to those obtained by models trained with real data.

翻译：差分隐私（DP）合成数据集是一种在保护数据提供者个人隐私的同时共享数据的解决方案。理解在端到端机器学习流程中使用DP合成数据的影响，对医疗和人道主义行动等数据稀缺且受严格隐私法律约束的领域具有重要意义。本研究探讨了合成数据在多大程度上可以替代机器学习流程中的真实表格数据，并识别出用于训练和评估机器学习模型的最有效合成数据生成技术。我们从效用性和公平性两个角度，研究了差分隐私合成数据对下游分类任务的影响。我们的分析是全面的，涵盖了两种主要合成数据生成算法的代表：基于边缘分布的算法和基于生成对抗网络（GAN）的算法。据我们所知，本研究首次：(i) 提出了一种不假设真实数据可用于测试基于合成数据训练的机器学习模型效用性与公平性的训练和评估框架；(ii) 在用于训练机器学习模型时，从效用性和公平性角度对合成数据集生成算法进行了最广泛的分析；(iii) 涵盖了多种不同的公平性定义。我们的研究结果表明，在表格数据的模型训练效用性方面，基于边缘分布的合成数据生成器优于基于GAN的生成器。事实上，我们证明，使用基于边缘分布算法生成的数据训练的模型可以表现出与使用真实数据训练的模型相似的效用性。我们的分析还揭示，基于边缘分布的合成数据生成器MWEM PGM能够训练出在效用性和公平性特征上均接近使用真实数据训练所得模型的模型。

相关内容

Machine Learning

关注 2251

机器学习（Machine Learning）是一个研究计算学习方法的国际论坛。该杂志发表文章，报告广泛的学习方法应用于各种学习问题的实质性结果。该杂志的特色论文描述研究的问题和方法，应用研究和研究方法的问题。有关学习问题或方法的论文通过实证研究、理论分析或与心理现象的比较提供了坚实的支持。应用论文展示了如何应用学习方法来解决重要的应用问题。研究方法论文改进了机器学习的研究方法。所有的论文都以其他研究人员可以验证或复制的方式描述了支持证据。论文还详细说明了学习的组成部分，并讨论了关于知识表示和性能任务的假设。官网地址：http://dblp.uni-trier.de/db/journals/ml/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日