Massively Annotated Datasets for Assessment of Synthetic and Real Data in Face Recognition

Face recognition applications have grown in parallel with the size of datasets, complexity of deep learning models and computational power. However, while deep learning models evolve to become more capable and computational power keeps increasing, the datasets available are being retracted and removed from public access. Privacy and ethical concerns are relevant topics within these domains. Through generative artificial intelligence, researchers have put efforts into the development of completely synthetic datasets that can be used to train face recognition systems. Nonetheless, the recent advances have not been sufficient to achieve performance comparable to the state-of-the-art models trained on real data. To study the drift between the performance of models trained on real and synthetic datasets, we leverage a massive attribute classifier (MAC) to create annotations for four datasets: two real and two synthetic. From these annotations, we conduct studies on the distribution of each attribute within all four datasets. Additionally, we further inspect the differences between real and synthetic datasets on the attribute set. When comparing through the Kullback-Leibler divergence we have found differences between real and synthetic samples. Interestingly enough, we have verified that while real samples suffice to explain the synthetic distribution, the opposite could not be further from being true.

翻译：人脸识别应用随着数据集规模、深度学习模型复杂度及计算能力的增长而同步发展。然而，尽管深度学习模型不断进化以增强能力，计算能力持续提升，可供使用的数据集却在陆续被撤销并从公共访问中移除。隐私与伦理问题已成为该领域的重要议题。研究者们借助生成式人工智能，致力于开发可用于训练人脸识别系统的完全合成数据集。然而，当前进展尚不足以在性能上媲美基于真实数据训练的最先进模型。为探究基于真实数据集与合成数据集训练的模型之间的性能差异，我们利用大规模属性分类器（MAC）为四个数据集（两个真实数据集、两个合成数据集）生成标注。基于这些标注，我们对四个数据集中各属性的分布展开研究，并进一步审视真实数据集与合成数据集在属性集上的差异。通过Kullback-Leibler散度进行对比时，我们发现真实样本与合成样本之间存在差异。有趣的是，我们验证得出：真实样本足以解释合成分布，但反之则远非如此。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日