Unveiling Synthetic Faces: How Synthetic Datasets Can Expose Real Identities

Synthetic data generation is gaining increasing popularity in different computer vision applications. Existing state-of-the-art face recognition models are trained using large-scale face datasets, which are crawled from the Internet and raise privacy and ethical concerns. To address such concerns, several works have proposed generating synthetic face datasets to train face recognition models. However, these methods depend on generative models, which are trained on real face images. In this work, we design a simple yet effective membership inference attack to systematically study if any of the existing synthetic face recognition datasets leak any information from the real data used to train the generator model. We provide an extensive study on 6 state-of-the-art synthetic face recognition datasets, and show that in all these synthetic datasets, several samples from the original real dataset are leaked. To our knowledge, this paper is the first work which shows the leakage from training data of generator models into the generated synthetic face recognition datasets. Our study demonstrates privacy pitfalls in synthetic face recognition datasets and paves the way for future studies on generating responsible synthetic face datasets.

翻译：合成数据生成在不同计算机视觉应用中日益普及。现有最先进的人脸识别模型均使用从互联网爬取的大规模人脸数据集进行训练，这引发了隐私与伦理担忧。为应对此类问题，已有若干研究提出生成合成人脸数据集以训练人脸识别模型。然而，这些方法依赖于生成模型，而生成模型本身需使用真实人脸图像进行训练。本研究设计了一种简单而有效的成员推理攻击方法，系统性地探究现有合成人脸识别数据集是否泄露了用于训练生成器模型的真实数据信息。我们对6个最先进的合成人脸识别数据集开展了广泛研究，结果表明所有合成数据集中均存在原始真实数据集样本的泄露。据我们所知，本文首次揭示了生成器模型训练数据向合成人脸识别数据集的泄露现象。本研究揭示了合成人脸识别数据集中的隐私隐患，为未来生成负责任的合成人脸数据集研究指明了方向。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日