In fair classification, it is common to train a model, and to compare and correct subgroup-specific error rates for disparities. However, even if a model's classification decisions satisfy a fairness metric, it is not necessarily the case that these decisions are equally confident. This becomes clear if we measure variance: We can fix everything in the learning process except the subset of training data, train multiple models, measure (dis)agreement in predictions for each test example, and interpret disagreement to mean that the learning process is more unstable with respect to its classification decision. Empirically, some decisions can in fact be so unstable that they are effectively arbitrary. To reduce this arbitrariness, we formalize a notion of self-consistency of a learning process, develop an ensembling algorithm that provably increases self-consistency, and empirically demonstrate its utility to often improve both fairness and accuracy. Further, our evaluation reveals a startling observation: Applying ensembling to common fair classification benchmarks can significantly reduce subgroup error rate disparities, without employing common pre-, in-, or post-processing fairness interventions. Taken together, our results indicate that variance, particularly on small datasets, can muddle the reliability of conclusions about fairness. One solution is to develop larger benchmark tasks. To this end, we release a toolkit that makes the Home Mortgage Disclosure Act datasets easily usable for future research.
翻译:在公平分类中,常见做法是训练一个模型,比较并修正子组间的错误率差异。然而,即使模型的分类决策满足公平性指标,这些决策也未必具有相同的置信度。通过测量方差可以明确这一点:我们固定学习过程中除训练数据子集外的所有因素,训练多个模型,测量每个测试样本预测结果的一致性(或不一致性),并将不一致性解释为学习过程对其分类决策的不稳定性。实验表明,某些决策实际上可能极不稳定,以至于具有任意性。为减少这种任意性,我们形式化了学习过程自洽性的概念,开发了一种可证明提高自洽性的集成学习算法,并实验证明其通常能同时改进公平性和准确性。此外,我们的评估揭示了一个惊人发现:将集成学习应用于常见的公平分类基准测试,可以显著减少子组错误率差异,而无需采用常见的预处理、处理中或后处理公平性干预措施。综合来看,我们的结果表明,方差(尤其是在小数据集上)可能混淆关于公平性结论的可靠性。解决方案之一是开发更大的基准任务。为此,我们发布了一个工具包,使《住房抵押贷款披露法》数据集能够方便地用于未来研究。