A Comprehensive Empirical Study of Bias Mitigation Methods for Machine Learning Classifiers

Software bias is an increasingly important operational concern for software engineers. We present a large-scale, comprehensive empirical study of 17 representative bias mitigation methods for Machine Learning (ML) classifiers, evaluated with 11 ML performance metrics (e.g., accuracy), 4 fairness metrics, and 20 types of fairness-performance trade-off assessment, applied to 8 widely-adopted software decision tasks. The empirical coverage is much more comprehensive, covering the largest numbers of bias mitigation methods, evaluation metrics, and fairness-performance trade-off measures compared to previous work on this important software property. We find that (1) the bias mitigation methods significantly decrease ML performance in 53% of the studied scenarios (ranging between 42%~66% according to different ML performance metrics); (2) the bias mitigation methods significantly improve fairness measured by the 4 used metrics in 46% of all the scenarios (ranging between 24%~59% according to different fairness metrics); (3) the bias mitigation methods even lead to decrease in both fairness and ML performance in 25% of the scenarios; (4) the effectiveness of the bias mitigation methods depends on tasks, models, the choice of protected attributes, and the set of metrics used to assess fairness and ML performance; (5) there is no bias mitigation method that can achieve the best trade-off in all the scenarios. The best method that we find outperforms other methods in 30% of the scenarios. Researchers and practitioners need to choose the bias mitigation method best suited to their intended application scenario(s).

翻译：软件偏见已成为软件工程师日益重要的操作关注点。我们开展了一项大规模、全面的实证研究，针对17种具有代表性的机器学习分类器偏见缓解方法，使用11种机器学习性能指标（如准确率）、4种公平性指标及20种公平性-性能权衡评估方式，应用于8个广泛采用的软件决策任务。实证覆盖范围远超此前关于这一重要软件属性的研究工作，涵盖了最大数量的偏见缓解方法、评估指标和公平性-性能权衡度量。研究发现：（1）在53%的研究场景中（根据不同机器学习性能指标，范围在42%~66%之间），偏见缓解方法显著降低了机器学习性能；（2）在46%的所有场景中（根据不同公平性指标，范围在24%~59%之间），偏见缓解方法显著提升了4种度量指标所衡量的公平性；（3）在25%的场景中，偏见缓解方法甚至导致公平性和机器学习性能同时下降；（4）偏见缓解方法的有效性取决于任务、模型、受保护属性的选择以及用于评估公平性和机器学习性能的度量指标集；（5）没有任何一种偏见缓解方法能在所有场景中实现最佳权衡。我们发现的最佳方法仅在30%的场景中优于其他方法。研究人员和从业者需根据具体应用场景选择最合适的偏见缓解方法。