How Far Can Fairness Constraints Help Recover From Biased Data?

A general belief in fair classification is that fairness constraints incur a trade-off with accuracy, which biased data may worsen. Contrary to this belief, Blum & Stangl (2019) show that fair classification with equal opportunity constraints even on extremely biased data can recover optimally accurate and fair classifiers on the original data distribution. Their result is interesting because it demonstrates that fairness constraints can implicitly rectify data bias and simultaneously overcome a perceived fairness-accuracy trade-off. Their data bias model simulates under-representation and label bias in underprivileged population, and they show the above result on a stylized data distribution with i.i.d. label noise, under simple conditions on the data distribution and bias parameters. We propose a general approach to extend the result of Blum & Stangl (2019) to different fairness constraints, data bias models, data distributions, and hypothesis classes. We strengthen their result, and extend it to the case when their stylized distribution has labels with Massart noise instead of i.i.d. noise. We prove a similar recovery result for arbitrary data distributions using fair reject option classifiers. We further generalize it to arbitrary data distributions and arbitrary hypothesis classes, i.e., we prove that for any data distribution, if the optimally accurate classifier in a given hypothesis class is fair and robust, then it can be recovered through fair classification with equal opportunity constraints on the biased distribution whenever the bias parameters satisfy certain simple conditions. Finally, we show applications of our technique to time-varying data bias in classification and fair machine learning pipelines.

翻译：公平分类中普遍认为公平性约束会带来与准确率的权衡，而有偏数据可能加剧这种权衡。与此相反，Blum & Stangl（2019）的研究表明，即使在极端有偏数据上施加机会均等约束的公平分类，也能恢复原始数据分布上的最优准确且公平的分类器。他们的结果令人关注，因为它证明公平性约束能够隐式纠正数据偏差，同时克服了感知到的公平-准确率权衡。他们的数据偏差模型模拟了弱势群体中的代表性不足与标签偏差，并在含独立同分布标签噪声的典型数据分布上，在数据分布与偏差参数的简单条件下验证了上述结果。我们提出了一种通用方法，将Blum & Stangl（2019）的结果扩展到不同的公平性约束、数据偏差模型、数据分布及假设类别。我们强化了他们的结论，并将其推广至标签噪声为马萨特噪声（而非独立同分布噪声）的典型分布情形。利用公平拒选分类器，我们为任意数据分布证明了类似的恢复结果。进一步地，我们将该结果推广至任意数据分布与任意假设类别：即证明对任意数据分布，若给定假设类中的最优准确分类器是公平且稳健的，则当偏差参数满足特定简单条件时，可通过在有偏分布上施加机会均等约束的公平分类恢复该分类器。最后，我们展示了本方法在分类时变数据偏差与公平机器学习流水线中的应用。