Determining whether an algorithmic decision-making system discriminates against a specific demographic typically involves comparing a single point estimate of a fairness metric against a predefined threshold. This practice is statistically brittle: it ignores sampling error and treats small demographic subgroups the same as large ones. The problem intensifies in intersectional analyses, where multiple sensitive attributes are considered jointly, giving rise to a larger number of smaller groups. As these groups become more granular, the data representing them becomes too sparse for reliable estimation, and fairness metrics yield excessively wide confidence intervals, precluding meaningful conclusions about potential unfair treatments. In this paper, we introduce a unified, size-adaptive, hypothesis-testing framework that turns fairness assessment into an evidence-based statistical decision. Our contribution is twofold. (i) For sufficiently large subgroups, we prove a Central-Limit result for the statistical parity difference, leading to analytic confidence intervals and a Wald test whose type-I (false positive) error is guaranteed at level $α$. (ii) For the long tail of small intersectional groups, we derive a fully Bayesian Dirichlet-multinomial estimator; Monte-Carlo credible intervals are calibrated for any sample size and naturally converge to Wald intervals as more data becomes available. We validate our approach empirically on benchmark datasets, demonstrating how our tests provide interpretable, statistically rigorous decisions under varying degrees of data availability and intersectionality.
翻译:判断一个算法决策系统是否对特定人群存在歧视,通常涉及将公平性指标的单一估计值与预设阈值进行比较。这种做法在统计学上具有脆弱性:它忽略了抽样误差,并将不同规模的人口亚组同等对待。在交叉分析中——当多个敏感属性被联合考虑时——这一问题更加严重,因为交叉分析会产生数量更多、规模更小的群体。随着这些群体变得越发细粒度,代表它们的数据变得过于稀疏,无法进行可靠估计,而公平性指标会生成过宽的置信区间,从而无法就潜在的不公平处理得出有意义的结论。在本文中,我们提出一个统一的自适应大小假设检验框架,将公平性评估转化为基于证据的统计决策。我们的贡献有两方面:(i) 针对足够大的子群体,我们证明了统计平局差分的中心极限定理,从而得到解析置信区间和Wald检验,其第一类(假阳性)错误率可保证在$α$水平上;(ii) 针对小规模交叉群体的长尾部分,我们推导出完全贝叶斯狄利克雷-多项估计量;蒙特卡洛置信区间可针对任意样本量进行校准,并随着数据增多自然收敛至Wald区间。我们通过在基准数据集上的实证验证,展示了我们的检验方法如何在数据可用性和交叉性变化的情况下,提供可解释的、统计严谨的决策。