Machine learning models in high-stakes applications, such as recidivism prediction and automated personnel selection, often exhibit systematic performance disparities across sensitive subpopulations, raising critical concerns regarding algorithmic bias. Fairness auditing addresses these risks through two primary functions: certification, which verifies adherence to fairness constraints; and flagging, which isolates specific demographic groups experiencing disparate treatment. However, existing auditing techniques are frequently limited by restrictive distributional assumptions or prohibitive computational overhead. We propose a novel empirical likelihood-based (EL) framework that constructs robust statistical measures for model performance disparities. Unlike traditional methods, our approach is non-parametric; the proposed disparity statistics follow asymptotically chi-square or mixed chi-square distributions, ensuring valid inference without assuming underlying data distributions. This framework uses a constrained optimization profile that admits stable numerical solutions, facilitating both large-scale certification and efficient subpopulation discovery. Empirically, the EL methods outperform bootstrap-based approaches, yielding coverage rates closer to nominal levels while reducing computational latency by several orders of magnitude. We demonstrate the practical utility of this framework on the COMPAS dataset, where it successfully flags intersectional biases, specifically identifying a significantly higher positive prediction rate for African-American males under 25 and a systemic under-prediction for Caucasian females relative to the population mean.
翻译:在诸如累犯预测和自动化人事选拔等高风险应用中,机器学习模型常在不同敏感子群体间表现出系统性性能差异,引发了关于算法偏见的严重关切。公平性审计通过两大主要功能应对这些风险:认证,即验证模型是否遵循公平性约束;以及标记,即识别遭受差异性待遇的具体人口群体。然而,现有的审计技术常受限于严格的分布假设或过高的计算开销。我们提出了一种新颖的基于经验似然(EL)的框架,该框架构建了用于模型性能差异的稳健统计度量。与传统方法不同,我们的方法是非参数化的;所提出的差异统计量渐近服从卡方或混合卡方分布,从而确保在无需假设底层数据分布的情况下进行有效推断。该框架采用一个允许稳定数值解的约束优化剖面,既便于大规模认证,也支持高效的子群体发现。实证表明,EL方法优于基于自助法的方法,其覆盖率更接近名义水平,同时将计算延迟降低了数个数量级。我们在COMPAS数据集上展示了该框架的实际效用,其成功标记了交叉性偏见,具体识别出25岁以下非裔美国男性的阳性预测率显著更高,而白人女性的预测率相对于总体均值存在系统性低估。