In the current artificial intelligence (AI) era, the scale and quality of the dataset play a crucial role in training a high-quality AI model. However, good data is not a free lunch and is always hard to access due to privacy regulations like the General Data Protection Regulation (GDPR). A potential solution is to release a synthetic dataset with a similar distribution to that of the private dataset. Nevertheless, in some scenarios, it has been found that the attributes needed to train an AI model belong to different parties, and they cannot share the raw data for synthetic data publication due to privacy regulations. In PETS 2023, Xue et al. proposed the first generative adversary network-based model, VertiGAN, for vertically partitioned data publication. However, after thoroughly investigating, we found that VertiGAN is less effective in preserving the correlation among the attributes of different parties. This article proposes a Vertical Federated Learning-based Generative Adversarial Network, VFLGAN, for vertically partitioned data publication to address the above issues. Our experimental results show that compared with VertiGAN, VFLGAN significantly improves the quality of synthetic data. Taking the MNIST dataset as an example, the quality of the synthetic dataset generated by VFLGAN is 3.2 times better than that generated by VertiGAN w.r.t. the Fr\'echet Distance. We also designed a more efficient and effective Gaussian mechanism for the proposed VFLGAN to provide the synthetic dataset with a differential privacy guarantee. On the other hand, differential privacy only gives the upper bound of the worst-case privacy guarantee. This article also proposes a practical auditing scheme that applies membership inference attacks to estimate privacy leakage through the synthetic dataset.
翻译:在当前人工智能时代,数据集规模与质量对训练高性能AI模型至关重要。然而,优质数据并非免费午餐,由于《通用数据保护条例》(GDPR)等隐私法规的限制,数据获取始终面临挑战。一种潜在解决方案是发布与私有数据集分布相似的合成数据集。但在某些场景中,训练AI模型所需的属性归属于不同参与方,且因隐私保护法规,各方无法共享原始数据进行合成数据发布。在PETS 2023会议上,Xue等人提出了首个基于生成对抗网络的垂直划分数据发布模型VertiGAN。但经深入研究发现,VertiGAN在保持不同参与方属性间相关性方面效果欠佳。本文提出一种基于纵向联邦学习的生成对抗网络VFLGAN,用于解决垂直划分数据发布中上述问题。实验结果表明,与VertiGAN相比,VFLGAN显著提升了合成数据质量。以MNIST数据集为例,基于弗雷歇距离指标,VFLGAN生成的合成数据集质量较VertiGAN提升3.2倍。同时,我们为VFLGAN设计了更高效的高斯机制,为合成数据集提供差分隐私保障。另一方面,差分隐私仅提供最坏情况下的隐私保护上界。本文还提出一种实用审计方案,通过成员推断攻击评估合成数据集可能导致的隐私泄露程度。