There is a need for synthetic training and test datasets that replicate statistical distributions of original datasets without compromising their confidentiality. A lot of research has been done in leveraging Generative Adversarial Networks (GANs) for synthetic data generation. However, the resulting models are either not accurate enough or are still vulnerable to membership inference attacks (MIA) or dataset reconstruction attacks since the original data has been leveraged in the training process. In this paper, we explore the feasibility of producing a synthetic test dataset with the same statistical properties as the original one, with only indirectly leveraging the original data in the generation process. The approach is inspired by GANs, with a generation step and a discrimination step. However, in our approach, we use a test generator (a fuzzer) to produce test data from an input specification, preserving constraints set by the original data; a discriminator model determines how close we are to the original data. By evolving samples and determining "good samples" with the discriminator, we can generate privacy-preserving data that follows the same statistical distributions are the original dataset, leading to a similar utility as the original data. We evaluated our approach on four datasets that have been used to evaluate the state-of-the-art techniques. Our experiments highlight the potential of our approach towards generating synthetic datasets that have high utility while preserving privacy.
翻译:当前亟需能够复现原始数据集统计分布且不损害其保密性的合成训练与测试数据集。已有大量研究利用生成对抗网络(GAN)进行合成数据生成。然而,由于训练过程中直接利用了原始数据,所得模型要么精度不足,要么仍易受成员推理攻击(MIA)或数据集重构攻击。本文探讨了一种仅间接利用原始数据生成过程、却能产生与原始数据具有相同统计特性的合成测试数据集的可行性。该方法受GAN启发,包含生成步骤与判别步骤。但在本方法中,我们使用测试生成器(模糊测试器)根据输入规范生成测试数据,同时保持原始数据设定的约束条件;判别器模型则评估生成数据与原始数据的接近程度。通过演化样本并利用判别器判定"优质样本",我们能够生成遵循原始数据集相同统计分布的隐私保护数据,从而获得与原始数据相近的实用价值。我们在四个已用于评估前沿技术的数据集上验证了本方法。实验结果表明,该方法在生成兼具高实用性与隐私保护性的合成数据集方面具有显著潜力。