Data is the lifeblood of the modern world, forming a fundamental part of AI, decision-making, and research advances. With increase in interest in data, governments have taken important steps towards a regulated data world, drastically impacting data sharing and data usability and resulting in massive amounts of data confined within the walls of organizations. While synthetic data generation (SDG) is an appealing solution to break down these walls and enable data sharing, the main drawback of existing solutions is the assumption of a trusted aggregator for generative model training. Given that many data holders may not want to, or be legally allowed to, entrust a central entity with their raw data, we propose a framework for the collaborative and private generation of synthetic tabular data from distributed data holders. Our solution is general, applicable to any marginal-based SDG, and provides input privacy by replacing the trusted aggregator with secure multi-party computation (MPC) protocols and output privacy via differential privacy (DP). We demonstrate the applicability and scalability of our approach for the state-of-the-art select-measure-generate SDG algorithms MWEM+PGM and AIM.
翻译:数据是现代世界的命脉,构成人工智能、决策制定和研究进步的基础部分。随着对数据兴趣的增加,各国政府已采取重要措施迈向数据规范化管理时代,这深刻影响了数据共享与数据可用性,导致大量数据被隔绝在组织内部。虽然合成数据生成是打破数据壁垒、实现数据共享的诱人方案,但现有方案的主要缺陷在于假设存在可信聚合器用于生成模型训练。考虑到许多数据持有者可能不愿或法律不允许将原始数据委托给中央实体,我们提出了一种框架,用于从分布式数据持有者处协作式隐私保护地生成合成表格数据。我们的方案具有通用性,适用于任何基于边际分布的合成数据生成方法,并通过使用安全多方计算协议替代可信聚合器实现输入隐私,通过差分隐私实现输出隐私。我们展示了该方法在最新"选择-测量-生成"式合成数据生成算法MWEM+PGM与AIM上的适用性与可扩展性。