VFLGAN: Vertical Federated Learning-based Generative Adversarial Network for Vertically Partitioned Data Publication

In the current artificial intelligence (AI) era, the scale and quality of the dataset play a crucial role in training a high-quality AI model. However, good data is not a free lunch and is always hard to access due to privacy regulations like the General Data Protection Regulation (GDPR). A potential solution is to release a synthetic dataset with a similar distribution to that of the private dataset. Nevertheless, in some scenarios, it has been found that the attributes needed to train an AI model belong to different parties, and they cannot share the raw data for synthetic data publication due to privacy regulations. In PETS 2023, Xue et al. proposed the first generative adversary network-based model, VertiGAN, for vertically partitioned data publication. However, after thoroughly investigating, we found that VertiGAN is less effective in preserving the correlation among the attributes of different parties. This article proposes a Vertical Federated Learning-based Generative Adversarial Network, VFLGAN, for vertically partitioned data publication to address the above issues. Our experimental results show that compared with VertiGAN, VFLGAN significantly improves the quality of synthetic data. Taking the MNIST dataset as an example, the quality of the synthetic dataset generated by VFLGAN is 3.2 times better than that generated by VertiGAN w.r.t. the Fr\'echet Distance. We also designed a more efficient and effective Gaussian mechanism for the proposed VFLGAN to provide the synthetic dataset with a differential privacy guarantee. On the other hand, differential privacy only gives the upper bound of the worst-case privacy guarantee. This article also proposes a practical auditing scheme that applies membership inference attacks to estimate privacy leakage through the synthetic dataset.

翻译：在当前人工智能时代，数据集规模与质量对训练高性能AI模型至关重要。然而，优质数据并非免费午餐，由于《通用数据保护条例》（GDPR）等隐私法规的限制，数据获取始终面临挑战。一种潜在解决方案是发布与私有数据集分布相似的合成数据集。但在某些场景中，训练AI模型所需的属性归属于不同参与方，且因隐私保护法规，各方无法共享原始数据进行合成数据发布。在PETS 2023会议上，Xue等人提出了首个基于生成对抗网络的垂直划分数据发布模型VertiGAN。但经深入研究发现，VertiGAN在保持不同参与方属性间相关性方面效果欠佳。本文提出一种基于纵向联邦学习的生成对抗网络VFLGAN，用于解决垂直划分数据发布中上述问题。实验结果表明，与VertiGAN相比，VFLGAN显著提升了合成数据质量。以MNIST数据集为例，基于弗雷歇距离指标，VFLGAN生成的合成数据集质量较VertiGAN提升3.2倍。同时，我们为VFLGAN设计了更高效的高斯机制，为合成数据集提供差分隐私保障。另一方面，差分隐私仅提供最坏情况下的隐私保护上界。本文还提出一种实用审计方案，通过成员推断攻击评估合成数据集可能导致的隐私泄露程度。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日