Synthetic tabular data is crucial for sharing and augmenting data across silos, especially for enterprises with proprietary data. However, existing synthesizers are designed for centrally stored data. Hence, they struggle with real-world scenarios where features are distributed across multiple silos, necessitating on-premise data storage. We introduce SiloFuse, a novel generative framework for high-quality synthesis from cross-silo tabular data. To ensure privacy, SiloFuse utilizes a distributed latent tabular diffusion architecture. Through autoencoders, latent representations are learned for each client's features, masking their actual values. We employ stacked distributed training to improve communication efficiency, reducing the number of rounds to a single step. Under SiloFuse, we prove the impossibility of data reconstruction for vertically partitioned synthesis and quantify privacy risks through three attacks using our benchmark framework. Experimental results on nine datasets showcase SiloFuse's competence against centralized diffusion-based synthesizers. Notably, SiloFuse achieves 43.8 and 29.8 higher percentage points over GANs in resemblance and utility. Experiments on communication show stacked training's fixed cost compared to the growing costs of end-to-end training as the number of training iterations increases. Additionally, SiloFuse proves robust to feature permutations and varying numbers of clients.
翻译:合成表格数据对于跨数据孤岛共享和增强数据至关重要,尤其适用于拥有专有数据的企业。然而,现有合成器是为集中式存储数据设计的,因此在特征分布于多个孤岛且需要本地数据存储的真实场景中表现不佳。我们提出SiloFuse,一种用于从跨孤岛表格数据中高质量合成的新型生成框架。为保障隐私,SiloFuse采用分布式潜在表格扩散架构。通过自编码器,为每个客户的特征学习潜在表示,从而掩盖其实际值。我们采用堆叠式分布式训练提升通信效率,将轮次减少至单一步骤。在SiloFuse框架下,我们证明了垂直分区合成中数据重构的不可行性,并通过基准框架中的三种攻击量化隐私风险。在九个数据集上的实验表明,SiloFuse在性能上可与基于集中式扩散的合成器相匹敌。值得注意的是,相较于生成对抗网络,SiloFuse在相似性和实用性上分别提升43.8和29.8个百分点。通信实验显示,堆叠式训练成本固定,而端到端训练成本随训练迭代次数增加而增长。此外,SiloFuse对特征排列和不同客户端数量具有鲁棒性。