Preserving individual privacy while enabling collaborative data sharing is crucial for organizations. Synthetic data generation is one solution, producing artificial data that mirrors the statistical properties of private data. While numerous techniques have been devised under differential privacy, they predominantly assume data is centralized. However, data is often distributed across multiple clients in a federated manner. In this work, we initiate the study of federated synthetic tabular data generation. Building upon a SOTA central method known as AIM, we present DistAIM and FLAIM. We first show that it is straightforward to distribute AIM, extending a recent approach based on secure multi-party computation which necessitates additional overhead, making it less suited to federated scenarios. We then demonstrate that naively federating AIM can lead to substantial degradation in utility under the presence of heterogeneity. To mitigate both issues, we propose an augmented FLAIM approach that maintains a private proxy of heterogeneity. We simulate our methods across a range of benchmark datasets under different degrees of heterogeneity and show we can improve utility while reducing overhead.
翻译:在组织间实现协作数据共享的同时保护个体隐私至关重要。合成数据生成作为一种解决方案,能够产生与私有数据统计特性相仿的人工数据。尽管差分隐私领域已涌现众多技术,但这些方法大多假设数据是集中存储的。然而现实中,数据往往以联邦方式分布在多个客户端。本研究首次系统探讨联邦式表格合成数据的生成问题。基于当前最先进的中心化方法AIM,我们提出了DistAIM与FLAIM两种方法。首先证明通过安全多方计算扩展AIM的分布式实现虽直接可行,但会引入额外开销使其难以适配联邦场景。继而揭示在异质性存在条件下,简单联邦化AIM将导致数据效用显著下降。为缓解上述双重问题,我们提出改进型FLAIM方法,该方法维护了一个关于数据异质性的私有代理参数。通过在多个基准数据集上模拟不同异质性程度的实验,我们验证了该方法在提升数据效用的同时能够有效降低计算开销。