CaPS: Collaborative and Private Synthetic Data Generation from Distributed Sources

Data is the lifeblood of the modern world, forming a fundamental part of AI, decision-making, and research advances. With increase in interest in data, governments have taken important steps towards a regulated data world, drastically impacting data sharing and data usability and resulting in massive amounts of data confined within the walls of organizations. While synthetic data generation (SDG) is an appealing solution to break down these walls and enable data sharing, the main drawback of existing solutions is the assumption of a trusted aggregator for generative model training. Given that many data holders may not want to, or be legally allowed to, entrust a central entity with their raw data, we propose a framework for the collaborative and private generation of synthetic tabular data from distributed data holders. Our solution is general, applicable to any marginal-based SDG, and provides input privacy by replacing the trusted aggregator with secure multi-party computation (MPC) protocols and output privacy via differential privacy (DP). We demonstrate the applicability and scalability of our approach for the state-of-the-art select-measure-generate SDG algorithms MWEM+PGM and AIM.

翻译：数据是现代世界的命脉，是人工智能、决策制定与科研进展的核心组成部分。随着数据关注度的提升，各国政府已在数据监管领域迈出关键步伐，这对数据共享与可用性产生深远影响，导致海量数据被禁锢在各组织机构内部。尽管合成数据生成技术为打破数据壁垒、实现数据共享提供了极具吸引力的解决方案，但现有方案的主要缺陷在于其默认存在可信聚合器进行生成模型训练。考虑到众多数据持有方可能不愿或受法律限制而无法将原始数据委托给中心化实体，本文提出一个面向分布式数据持有方的协作式隐私表格数据生成框架。该方案具有通用性，可适配任何基于边际分布的合成数据生成方法，并通过安全多方计算协议替代可信聚合器以实现输入隐私保护，同时结合差分隐私技术保障输出隐私。我们通过当前最先进的"选择-测量-生成"类合成数据生成算法MWEM+PGM与AIM，验证了该框架的适用性与可扩展性。

相关内容

AIM

关注 660

医学人工智能AIM（Artificial Intelligence in Medicine）杂志发表了多学科领域的原创文章，涉及医学中的人工智能理论和实践，以医学为导向的人类生物学和卫生保健。医学中的人工智能可以被描述为与研究、项目和应用相关的科学学科，旨在通过基于知识或数据密集型的计算机解决方案支持基于决策的医疗任务，最终支持和改善人类护理提供者的性能。官网地址：http://dblp.uni-trier.de/db/journals/artmed/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日