In fair classification, it is common to train a model, and to compare and correct subgroup-specific error rates for disparities. However, even if a model's classification decisions satisfy a fairness metric, it is not necessarily the case that these decisions are equally confident. This becomes clear if we measure variance: We can fix everything in the learning process except the subset of training data, train multiple models, measure (dis)agreement in predictions for each test example, and interpret disagreement to mean that the learning process is more unstable with respect to its classification decision. Empirically, some decisions can in fact be so unstable that they are effectively arbitrary. To reduce this arbitrariness, we formalize a notion of self-consistency of a learning process, develop an ensembling algorithm that provably increases self-consistency, and empirically demonstrate its utility to often improve both fairness and accuracy. Further, our evaluation reveals a startling observation: Applying ensembling to common fair classification benchmarks can significantly reduce subgroup error rate disparities, without employing common pre-, in-, or post-processing fairness interventions. Taken together, our results indicate that variance, particularly on small datasets, can muddle the reliability of conclusions about fairness. One solution is to develop larger benchmark tasks. To this end, we release a toolkit that makes the Home Mortgage Disclosure Act datasets easily usable for future research.
翻译:在公平分类中,通常的做法是训练模型,并比较和校正子组特定的错误率差异。然而,即使模型的分类决策满足某种公平性度量,这些决策的置信度也未必相同。这一点在衡量方差时会变得清晰:我们可以固定学习过程中除训练数据子集之外的所有因素,训练多个模型,测量每个测试样本预测结果的一致(或不一致)程度,并将不一致解释为学习过程在其分类决策上更不稳定。实验表明,某些决策实际上可能非常不稳定,以至于它们几乎是任意的。为了减少这种任意性,我们形式化了一种学习过程的自洽性概念,开发了一种能够可证明地提高自洽性的集成算法,并实证展示了其在改善公平性和准确性方面的效用。此外,我们的评估揭示了一个惊人的发现:将集成方法应用于常见的公平分类基准测试,可以在不采用常见的预处理、处理中或后处理公平干预手段的情况下,显著减少子组错误率差异。综合来看,我们的结果表明,方差(尤其是在小数据集上)可能会混淆关于公平性结论的可靠性。一个解决方案是开发更大的基准测试任务。为此,我们发布了一个工具包,使得住房抵押贷款披露法案数据集能够更容易地用于未来的研究。