Why Aggregate Accuracy is Inadequate for Evaluating Fairness in Law Enforcement Facial Recognition Systems

from arxiv, 9 pages, 2 tables, 1 figure. Position paper with empirical subgroup analysis highlighting limitations of aggregate accuracy in fairness evaluation

Facial recognition systems are increasingly deployed in law enforcement and security contexts, where algorithmic decisions can carry significant societal consequences. Despite high reported accuracy, growing evidence demonstrates that such systems often exhibit uneven performance across demographic groups, leading to disproportionate error rates and potential harm. This paper argues that aggregate accuracy is an insufficient metric for evaluating the fairness and reliability of facial recognition systems in high-stakes environments. Through analysis of subgroup-level error distribution, including false positive rate (FPR) and false negative rate (FNR), the paper demonstrates how aggregate performance metrics can obscure critical disparities across demographic groups. Empirical observations show that systems with similar overall accuracy can exhibit substantially different fairness profiles, with subgroup error rates varying significantly despite a single aggregate metric. The paper further examines the operational risks associated with accuracy-centric evaluation practices in law enforcement applications, where misclassification may result in wrongful suspicion or missed identification. It highlights the importance of fairness-aware evaluation approaches and model-agnostic auditing strategies that enable post-deployment assessment of real-world systems. The findings emphasise the need to move beyond accuracy as a primary metric and adopt more comprehensive evaluation frameworks for responsible AI deployment.

翻译：摘要：人脸识别系统越来越多地被部署在执法和安全领域，其中算法决策可能带来重大的社会影响。尽管报告的准确率很高，但越来越多的证据表明，此类系统在不同人口群体中往往表现出不平衡的性能，导致错误率不成比例并可能造成伤害。本文认为，在高风险环境中，聚合精度是评估人脸识别系统公平性和可靠性的一个不充分指标。通过对子群体层面的错误分布（包括假阳性率（FPR）和假阴性率（FNR））进行剖析，本文展示了聚合性能指标如何能掩盖不同人口群体之间的关键差异。实证观察表明，总体准确率相近的系统可能表现出截然不同的公平性特征，尽管只有一个单一的聚合指标，但各子群体的错误率却存在显著差异。本文进一步审视了执法应用中以精度为核心的评估实践所带来的操作风险，在这些应用中，错误分类可能导致无辜被怀疑或识别遗漏。文章强调了采用公平性感知评估方法和模型无关的审计策略的重要性，这些方法能够对真实世界系统进行部署后评估。研究结果强调，有必要超越将精度作为主要指标的局限，采用更全面的评估框架以实现负责任的人工智能部署。