Variance in predictions across different trained models is a significant, under-explored source of error in fair classification. In practice, the variance on some data examples is so large that decisions can be effectively arbitrary. To investigate this problem, we take an experimental approach and make four overarching contributions: We 1) Define a metric called self-consistency, derived from variance, which we use as a proxy for measuring and reducing arbitrariness; 2) Develop an ensembling algorithm that abstains from classification when a prediction would be arbitrary; 3) Conduct the largest to-date empirical study of the role of variance (vis-a-vis self-consistency and arbitrariness) in fair classification; and, 4) Release a toolkit that makes the US Home Mortgage Disclosure Act (HMDA) datasets easily usable for future research. Altogether, our experiments reveal shocking insights about the reliability of conclusions on benchmark datasets. Most fairness classification benchmarks are close-to-fair when taking into account the amount of arbitrariness present in predictions -- before we even try to apply common fairness interventions. This finding calls into question the practical utility of common algorithmic fairness methods, and in turn suggests that we should fundamentally reconsider how we choose to measure fairness in machine learning.
翻译:在不同训练模型的预测中,方差是公平分类中一个显著但尚未充分探索的误差来源。实践中,某些数据样本的方差如此之大,以至于决策可能变得完全任意。为探究此问题,我们采取实验方法,并做出四项总体贡献:1)定义一个基于方差的度量指标——自一致性,作为衡量和减少任意性的代理;2)开发一种集成算法,当预测可能任意时放弃分类;3)开展迄今为止最大规模的实证研究,考察方差(相对于自一致性与任意性)在公平分类中的作用;4)发布一个工具包,使美国《住房抵押贷款披露法案》(HMDA)数据集便于未来研究使用。总体而言,我们的实验揭示了关于基准数据集结论可靠性的惊人见解。考虑到预测中存在的任意性程度,大多数公平性分类基准在尚未应用常见的公平干预措施时,已接近公平。这一发现对常见算法公平性方法的实际效用提出质疑,进而表明我们应从根本上重新思考如何衡量机器学习中的公平性。