Machine learning models are increasingly used in critical decision-making applications. However, these models are susceptible to replicating or even amplifying bias present in real-world data. While there are various bias mitigation methods and base estimators in the literature, selecting the optimal model for a specific application remains challenging. This paper focuses on binary classification and proposes FairGridSearch, a novel framework for comparing fairness-enhancing models. FairGridSearch enables experimentation with different model parameter combinations and recommends the best one. The study applies FairGridSearch to three popular datasets (Adult, COMPAS, and German Credit) and analyzes the impacts of metric selection, base estimator choice, and classification threshold on model fairness. The results highlight the significance of selecting appropriate accuracy and fairness metrics for model evaluation. Additionally, different base estimators and classification threshold values affect the effectiveness of bias mitigation methods and fairness stability respectively, but the effects are not consistent across all datasets. Based on these findings, future research on fairness in machine learning should consider a broader range of factors when building fair models, going beyond bias mitigation methods alone.
翻译:机器学习模型越来越多地应用于关键决策场景。然而,这些模型容易复制甚至放大现实数据中存在的偏差。尽管文献中已有多种偏差缓解方法和基础估计器,但为特定应用选择最优模型仍具挑战性。本文聚焦二分类问题,提出了一种新颖的公平性增强模型比较框架——FairGridSearch。该框架支持对不同模型参数组合进行实验,并推荐最优方案。本研究将FairGridSearch应用于三个常用数据集(Adult、COMPAS和German Credit),分析了指标选择、基础估计器类型和分类阈值对模型公平性的影响。结果表明,选择恰当的准确率与公平性指标对模型评估至关重要。此外,不同基础估计器和分类阈值会分别影响偏差缓解方法的有效性及公平性稳定性,但跨数据集的影响效果并不一致。基于上述发现,未来机器学习公平性研究在构建公平模型时应考虑更广泛的因素,而不仅限于偏差缓解方法。