Disaggregated evaluation across subgroups is critical for assessing the fairness of machine learning models, but its uncritical use can mislead practitioners. We show that equal performance across subgroups is an unreliable measure of fairness when data are representative of the relevant populations but reflective of real-world disparities. Furthermore, when data are not representative due to selection bias, both disaggregated evaluation and alternative approaches based on conditional independence testing may be invalid without explicit assumptions regarding the bias mechanism. We use causal graphical models to characterize fairness properties and metric stability across subgroups under different data generating processes. Our framework suggests complementing disaggregated evaluations with explicit causal assumptions and analysis to control for confounding and distribution shift, including conditional independence testing and weighted performance estimation. These findings have broad implications for how practitioners design and interpret model assessments given the ubiquity of disaggregated evaluation.
翻译:跨子组的细粒度评估对于评估机器学习模型的公平性至关重要,但其不加批判的使用可能会误导实践者。我们证明,当数据代表相关总体但反映现实世界差异时,跨子组的平等性能是公平性的不可靠度量。此外,当数据因选择偏差而不具代表性时,若无关于偏差机制的明确假设,细粒度评估和基于条件独立性检验的替代方法都可能无效。我们使用因果图模型来刻画不同数据生成过程下跨子组的公平性属性和度量稳定性。我们的框架建议用明确的因果假设和分析来补充细粒度评估,以控制混杂因素和分布偏移,包括条件独立性检验和加权性能估计。鉴于细粒度评估的普遍性,这些发现对实践者如何设计和解释模型评估具有广泛意义。