Before deploying a black-box model in high-stakes problems, it is important to evaluate the model's performance on sensitive subpopulations. For example, in a recidivism prediction task, we may wish to identify demographic groups for which our prediction model has unacceptably high false positive rates or certify that no such groups exist. In this paper, we frame this task, often referred to as "fairness auditing," in terms of multiple hypothesis testing. We show how the bootstrap can be used to simultaneously bound performance disparities over a collection of groups with statistical guarantees. Our methods can be used to flag subpopulations affected by model underperformance, and certify subpopulations for which the model performs adequately. Crucially, our audit is model-agnostic and applicable to nearly any performance metric or group fairness criterion. Our methods also accommodate extremely rich -- even infinite -- collections of subpopulations. Further, we generalize beyond subpopulations by showing how to assess performance over certain distribution shifts. We test the proposed methods on benchmark datasets in predictive inference and algorithmic fairness and find that our audits can provide interpretable and trustworthy guarantees.
翻译:在高风险问题中部署黑箱模型之前,评估模型在敏感子群体上的性能至关重要。例如,在累犯预测任务中,我们可能希望识别出预测模型具有不可接受高假阳性率的人口统计群体,或验证不存在此类群体。本文将这一常被称为"公平审计"的任务框架化为多重假设检验问题。我们展示了如何利用自助法对一组群体的性能差异进行统计保证的联合界定。所提方法可用于标记受模型性能不足影响的子群体,并验证模型表现充分的子群体。关键的是,我们的审计方法具有模型无关性,适用于几乎所有性能指标或群体公平性准则。该方法还能处理极其丰富(甚至无限)的子群体集合。此外,我们通过展示如何评估特定分布偏移下的性能,将研究泛化至子群体之外。我们在预测推理与算法公平性的基准数据集上测试了所提方法,发现我们的审计能够提供可解释且可信的保证。