Variance in predictions across different trained models is a significant, under-explored source of error in fair binary classification. In practice, the variance on some data examples is so large that decisions can be effectively arbitrary. To investigate this problem, we take an experimental approach and make four overarching contributions: We: 1) Define a metric called self-consistency, derived from variance, which we use as a proxy for measuring and reducing arbitrariness; 2) Develop an ensembling algorithm that abstains from classification when a prediction would be arbitrary; 3) Conduct the largest to-date empirical study of the role of variance (vis-a-vis self-consistency and arbitrariness) in fair binary classification; and, 4) Release a toolkit that makes the US Home Mortgage Disclosure Act (HMDA) datasets easily usable for future research. Altogether, our experiments reveal shocking insights about the reliability of conclusions on benchmark datasets. Most fair binary classification benchmarks are close-to-fair when taking into account the amount of arbitrariness present in predictions -- before we even try to apply any fairness interventions. This finding calls into question the practical utility of common algorithmic fairness methods, and in turn suggests that we should reconsider how we choose to measure fairness in binary classification.
翻译:不同训练模型预测结果的方差是公平二分类中一个被严重忽视的显著误差来源。实际应用中,某些数据样本的方差极大,以至于决策结果可能本质上具有任意性。针对这一问题,我们采用实验方法并做出四项核心贡献:1)定义了一种基于方差的度量指标——自一致性,作为衡量与降低任意性的代理指标;2)开发了一种集成算法,当预测结果具有任意性时放弃分类决策;3)开展了迄今为止规模最大的实证研究,系统分析了方差(涉及自一致性与任意性)在公平二分类中的作用;4)发布了一套工具包,使美国住房抵押贷款披露法案(HMDA)数据集能便捷地应用于未来研究。综合而言,我们的实验揭示了基准数据集结论可靠性的惊人发现:大多数公平二分类基准在考虑预测结果中存在的任意性程度时——甚至在我们尝试应用任何公平性干预措施之前——已接近公平。这一发现对常见算法公平性方法的实际效用提出质疑,进而表明我们应当重新审视如何选择二分类中的公平性衡量标准。