Data is the lifeblood of the modern world, forming a fundamental part of AI, decision-making, and research advances. With increase in interest in data, governments have taken important steps towards a regulated data world, drastically impacting data sharing and data usability and resulting in massive amounts of data confined within the walls of organizations. While synthetic data generation (SDG) is an appealing solution to break down these walls and enable data sharing, the main drawback of existing solutions is the assumption of a trusted aggregator for generative model training. Given that many data holders may not want to, or be legally allowed to, entrust a central entity with their raw data, we propose a framework for the collaborative and private generation of synthetic tabular data from distributed data holders. Our solution is general, applicable to any marginal-based SDG, and provides input privacy by replacing the trusted aggregator with secure multi-party computation (MPC) protocols and output privacy via differential privacy (DP). We demonstrate the applicability and scalability of our approach for the state-of-the-art select-measure-generate SDG algorithms MWEM+PGM and AIM.
翻译:数据是现代世界的命脉,是人工智能、决策制定与科研进展的核心组成部分。随着数据关注度的提升,各国政府已在数据监管领域迈出关键步伐,这对数据共享与可用性产生深远影响,导致海量数据被禁锢在各组织机构内部。尽管合成数据生成技术为打破数据壁垒、实现数据共享提供了极具吸引力的解决方案,但现有方案的主要缺陷在于其默认存在可信聚合器进行生成模型训练。考虑到众多数据持有方可能不愿或受法律限制而无法将原始数据委托给中心化实体,本文提出一个面向分布式数据持有方的协作式隐私表格数据生成框架。该方案具有通用性,可适配任何基于边际分布的合成数据生成方法,并通过安全多方计算协议替代可信聚合器以实现输入隐私保护,同时结合差分隐私技术保障输出隐私。我们通过当前最先进的"选择-测量-生成"类合成数据生成算法MWEM+PGM与AIM,验证了该框架的适用性与可扩展性。