Deep learning methods have been achieved brilliant results in face recognition. One of the important tasks to improve the performance is to collect and label images as many as possible. However, labeling identities and checking qualities of large image data are difficult task and mistakes cannot be avoided in processing large data. Previous works have been trying to deal with the problem only in training domain, however it can cause much serious problem if the mistakes are in gallery data of face identification. We proposed gallery data sampling methods which are robust to outliers including wrong labeled, low quality, and less-informative images and reduce searching time. The proposed sampling-by-pruning and sampling-by-generating methods significantly improved face identification performance on our 5.4M web image dataset of celebrities. The proposed method achieved 0.0975 in terms of FNIR at FPIR=0.01, while conventional method showed 0.3891. The average number of feature vectors for each individual gallery was reduced to 17.1 from 115.9 and it can provide much faster search. We also made experiments on public datasets and our method achieved 0.1314 and 0.0668 FNIRs at FPIR=0.01 on the CASIA-WebFace and MS1MV2, while the convectional method did 0.5446, and 0.1327, respectively.
翻译:深度学习方法在人脸识别领域已取得显著成果。提升性能的重要任务之一是尽可能多地收集和标注图像。然而,对大规模图像数据进行身份标注和质量检查具有挑战性,在数据处理过程中难以避免错误。以往研究主要尝试在训练阶段解决该问题,但若错误存在于人脸识别图库数据中,则可能引发更严重的问题。我们提出了对异常值(包括错误标注、低质量及信息量不足的图像)具有鲁棒性且能缩短搜索时间的图库数据采样方法。所提出的剪枝采样与生成采样方法在包含540万张名人的网络图像数据集上显著提升了人脸识别性能。在FPIR=0.01时,所提方法的FNIR达到0.0975,而传统方法为0.3891。每个个体图库的特征向量平均数量从115.9降至17.1,从而实现更快速的搜索。我们还在公开数据集上进行了实验:在CASIA-WebFace和MS1MV2数据集上,当FPIR=0.01时,所提方法的FNIR分别达到0.1314和0.0668,而传统方法分别为0.5446和0.1327。