In fairness audits, a standard objective is to detect whether a given algorithm performs substantially differently between subgroups. Properly powering the statistical analysis of such audits is crucial for obtaining informative fairness assessments, as it ensures a high probability of detecting unfairness when it exists. However, limited guidance is available on the amount of data necessary for a fairness audit, lacking directly applicable results concerning commonly used fairness metrics. Additionally, the consideration of unequal subgroup sample sizes is also missing. In this tutorial, we address these issues by providing guidance on how to determine the required subgroup sample sizes to maximize the statistical power of hypothesis tests for detecting unfairness. Our findings are applicable to audits of binary classification models and multiple fairness metrics derived as summaries of the confusion matrix. Furthermore, we discuss other aspects of audit study designs that can increase the reliability of audit results.
翻译:在公平性审计中,一个标准目标是检测给定算法是否在不同子组之间存在显著差异。为这类审计的统计分析提供充分效能对于获得有信息量的公平性评估至关重要,因为这可确保当存在不公平时有高概率检测到它。然而,关于公平性审计所需数据量的指导十分有限,缺乏针对常用公平性指标的直接适用结果。此外,也缺少对不等子组样本量的考量。在本教程中,我们通过提供如何确定所需子组样本量以最大化检测不公平性假设检验统计效能的指导,来解决这些问题。我们的研究结果适用于二分类模型的审计,以及多个源于混淆矩阵汇总的公平性指标。此外,我们还探讨了能够提高审计结果可靠性的其他审计研究设计方案。