Synthetic data generation is gaining increasing popularity in different computer vision applications. Existing state-of-the-art face recognition models are trained using large-scale face datasets, which are crawled from the Internet and raise privacy and ethical concerns. To address such concerns, several works have proposed generating synthetic face datasets to train face recognition models. However, these methods depend on generative models, which are trained on real face images. In this work, we design a simple yet effective membership inference attack to systematically study if any of the existing synthetic face recognition datasets leak any information from the real data used to train the generator model. We provide an extensive study on 6 state-of-the-art synthetic face recognition datasets, and show that in all these synthetic datasets, several samples from the original real dataset are leaked. To our knowledge, this paper is the first work which shows the leakage from training data of generator models into the generated synthetic face recognition datasets. Our study demonstrates privacy pitfalls in synthetic face recognition datasets and paves the way for future studies on generating responsible synthetic face datasets.
翻译:合成数据生成在不同计算机视觉应用中日益普及。现有最先进的人脸识别模型均使用从互联网爬取的大规模人脸数据集进行训练,这引发了隐私与伦理担忧。为应对此类问题,已有若干研究提出生成合成人脸数据集以训练人脸识别模型。然而,这些方法依赖于生成模型,而生成模型本身需使用真实人脸图像进行训练。本研究设计了一种简单而有效的成员推理攻击方法,系统性地探究现有合成人脸识别数据集是否泄露了用于训练生成器模型的真实数据信息。我们对6个最先进的合成人脸识别数据集开展了广泛研究,结果表明所有合成数据集中均存在原始真实数据集样本的泄露。据我们所知,本文首次揭示了生成器模型训练数据向合成人脸识别数据集的泄露现象。本研究揭示了合成人脸识别数据集中的隐私隐患,为未来生成负责任的合成人脸数据集研究指明了方向。