GTV: Generating Tabular Data via Vertical Federated Learning

Generative Adversarial Networks (GANs) have achieved state-of-the-art results in tabular data synthesis, under the presumption of direct accessible training data. Vertical Federated Learning (VFL) is a paradigm which allows to distributedly train machine learning model with clients possessing unique features pertaining to the same individuals, where the tabular data learning is the primary use case. However, it is unknown if tabular GANs can be learned in VFL. Demand for secure data transfer among clients and GAN during training and data synthesizing poses extra challenge. Conditional vector for tabular GANs is a valuable tool to control specific features of generated data. But it contains sensitive information from real data - risking privacy guarantees. In this paper, we propose GTV, a VFL framework for tabular GANs, whose key components are generator, discriminator and the conditional vector. GTV proposes an unique distributed training architecture for generator and discriminator to access training data in a privacy-preserving manner. To accommodate conditional vector into training without privacy leakage, GTV designs a mechanism training-with-shuffling to ensure that no party can reconstruct training data with conditional vector. We evaluate the effectiveness of GTV in terms of synthetic data quality, and overall training scalability. Results show that GTV can consistently generate high-fidelity synthetic tabular data of comparable quality to that generated by centralized GAN algorithm. The difference on machine learning utility can be as low as to 2.7%, even under extremely imbalanced data distributions across clients and different number of clients.

翻译：生成对抗网络（GANs）在假设训练数据可直接访问的前提下，已在表格数据合成中取得最先进成果。纵向联邦学习（VFL）是一种范式，允许利用拥有同一批个体不同特征的客户端分布式训练机器学习模型，其中表格数据学习是其首要应用场景。然而，目前尚不清楚表格GANs是否能在VFL中学习。训练和数据合成过程中，客户端与GAN之间的安全数据传输需求带来了额外挑战。表格GANs的条件向量是控制生成数据特定特征的重要工具，但其包含真实数据的敏感信息，可能危及隐私保障。本文提出GTV，一种针对表格GANs的VFL框架，其核心组件为生成器、判别器和条件向量。GTV为生成器和判别器设计了一种独特的分布式训练架构，使其能在保护隐私的前提下访问训练数据。为在不泄露隐私的情况下将条件向量融入训练，GTV设计了“训练-洗牌”机制，确保任何一方无法利用条件向量重构训练数据。我们从合成数据质量和整体训练可扩展性两方面评估了GTV的有效性。结果表明，GTV能持续生成高保真度的合成表格数据，其质量与集中式GAN算法生成的数据相当。即使在客户端间数据分布极度不平衡且客户端数量不同的情况下，机器学习效用的差异最低可降至2.7%。