FairBalance: How to Achieve Equalized Odds With Data Pre-processing

This research seeks to benefit the software engineering society by providing a simple yet effective pre-processing approach to achieve equalized odds fairness in machine learning software. Fairness issues have attracted increasing attention since machine learning software is increasingly used for high-stakes and high-risk decisions. Amongst all the existing fairness notions, this work specifically targets "equalized odds" given its advantage in always allowing perfect classifiers. Equalized odds requires that members of every demographic group do not receive disparate mistreatment. Prior works either optimize for an equalized odds related metric during the learning process like a black-box, or manipulate the training data following some intuition. This work studies the root cause of the violation of equalized odds and how to tackle it. We found that equalizing the class distribution in each demographic group with sample weights is a necessary condition for achieving equalized odds without modifying the normal training process. In addition, an important partial condition for equalized odds (zero average odds difference) can be guaranteed when the class distributions are weighted to be not only equal but also balanced (1:1). Based on these analyses, we proposed FairBalance, a pre-processing algorithm which balances the class distribution in each demographic group by assigning calculated weights to the training data. On eight real-world datasets, our empirical results show that, at low computational overhead, the proposed pre-processing algorithm FairBalance can significantly improve equalized odds without much, if any damage to the utility. FairBalance also outperforms existing state-of-the-art approaches in terms of equalized odds. To facilitate reuse, reproduction, and validation, we made our scripts available at https://github.com/hil-se/FairBalance.

翻译：摘要：本研究旨在通过提供一种简单而有效的预处理方法，在机器学习软件中实现均等几率公平性，从而惠及软件工程领域。随着机器学习软件越来越多地被用于高风险和高危决策，公平性问题日益受到关注。在所有现有的公平性概念中，本文特别针对“均等几率”展开研究，因其具有始终允许完美分类器的优势。均等几率要求每个社会人口群体的成员不遭受不同的不当对待。现有方法要么在学习过程中（如黑箱般）优化与均等几率相关的指标，要么基于某些直觉操作训练数据。本研究探讨了违反均等几率的根本原因及其应对策略。我们发现，在不修改正常训练过程的情况下，通过样本权重均衡每个社会人口群体中的类别分布是实现均等几率的必要条件。此外，当类别分布被加权为不仅相等而且平衡（1:1）时，可以保证均等几率的一个重要部分条件（零平均几率差）。基于这些分析，我们提出了FairBalance——一种预处理算法，通过为训练数据分配计算出的权重来平衡每个社会人口群体中的类别分布。在八个真实世界数据集上，我们的实证结果表明，在较低的计算开销下，所提出的预处理算法FairBalance能够显著改善均等几率，而不会对效用造成显著损害（即使有也很少）。此外，在均等几率方面，FairBalance也优于现有的最先进方法。为便于复用、复现和验证，我们已将脚本发布于https://github.com/hil-se/FairBalance。