Before deploying a black-box model in high-stakes problems, it is important to evaluate the model's performance on sensitive subpopulations. For example, in a recidivism prediction task, we may wish to identify demographic groups for which our prediction model has unacceptably high false positive rates or certify that no such groups exist. In this paper, we frame this task, often referred to as "fairness auditing," in terms of multiple hypothesis testing. We show how the bootstrap can be used to simultaneously bound performance disparities over a collection of groups with statistical guarantees. Our methods can be used to flag subpopulations affected by model underperformance, and certify subpopulations for which the model performs adequately. Crucially, our audit is model-agnostic and applicable to nearly any performance metric or group fairness criterion. Our methods also accommodate extremely rich -- even infinite -- collections of subpopulations. Further, we generalize beyond subpopulations by showing how to assess performance over certain distribution shifts. We test the proposed methods on benchmark datasets in predictive inference and algorithmic fairness and find that our audits can provide interpretable and trustworthy guarantees.
翻译:在高风险问题中部署黑箱模型之前,评估模型在敏感子群体上的性能至关重要。例如,在累犯预测任务中,我们可能希望识别出预测模型具有不可接受的高误报率的人口统计群体,或验证不存在此类群体。本文将这一常被称为"公平性审计"的任务表述为多重假设检验问题。我们展示了如何利用自助法(bootstrap)在统计保证下同时约束一组子群体的性能差异。该方法可用于标记受模型性能不足影响的子群体,并验证模型表现达标的子群体。关键在于,我们的审计方法具有模型无关性,适用于几乎任何性能指标或群体公平性准则。该方法还能处理极其丰富(甚至无限)的子群体集合。此外,我们通过展示如何评估特定分布偏移下的性能,将方法推广至超越子群体的情形。我们在预测推断和算法公平性的基准数据集上测试了所提方法,发现该审计方法能提供可解释且可信的保障。