Comparing Human and Machine Bias in Face Recognition

Samuel Dooley,Ryan Downing,George Wei,Nathan Shankar,Bradon Thymes,Gudrun Thorkelsdottir,Tiye Kurtz-Miott,Rachel Mattson,Olufemi Obiwumi,Valeriia Cherepanova,Micah Goldblum,John P Dickerson,Tom Goldstein

Much recent research has uncovered and discussed serious concerns of bias in facial analysis technologies, finding performance disparities between groups of people based on perceived gender, skin type, lighting condition, etc. These audits are immensely important and successful at measuring algorithmic bias but have two major challenges: the audits (1) use facial recognition datasets which lack quality metadata, like LFW and CelebA, and (2) do not compare their observed algorithmic bias to the biases of their human alternatives. In this paper, we release improvements to the LFW and CelebA datasets which will enable future researchers to obtain measurements of algorithmic bias that are not tainted by major flaws in the dataset (e.g. identical images appearing in both the gallery and test set). We also use these new data to develop a series of challenging facial identification and verification questions that we administered to various algorithms and a large, balanced sample of human reviewers. We find that both computer models and human survey participants perform significantly better at the verification task, generally obtain lower accuracy rates on dark-skinned or female subjects for both tasks, and obtain higher accuracy rates when their demographics match that of the question. Computer models are observed to achieve a higher level of accuracy than the survey participants on both tasks and exhibit bias to similar degrees as the human survey participants.

翻译：最近的许多研究发现并讨论了对面部分析技术偏差的严重关切,发现不同人群之间在性别、皮肤类型、照明条件等方面的业绩差异。这些审计在计量算法偏差方面非常重要,非常成功,但有两大挑战:审计(1) 使用缺乏高质量元数据的面部识别数据集,如LFW和CelebA, 以及(2) 不将其观察到的算法偏差与人类替代方法偏差进行比较。在本文件中,我们公布了对LFW和CelibA数据集的改进,这将使未来的研究人员能够获得不受数据集重大缺陷污染的算法偏差的测量(例如,在画廊和测试集中都出现相同的图像),这些审计具有极大的重要性。我们还利用这些新数据来制定一系列具有挑战性的面部识别和核查问题,我们向各种算法和大量平衡的人体审查者进行这些测试。我们发现,计算机模型和人类调查参与者在核查任务方面表现得要好得多,通常在黑皮肤或女性科目上获得更准确的精确率,并在其人口统计与问题相符的情况下获得更高的准确率。我们观察到了计算机模型参与者在调查上的偏差,以达到比人类调查的高度。