With fairness concerns gaining significant attention in Machine Learning (ML), several bias mitigation techniques have been proposed, often compared against each other to find the best method. These benchmarking efforts tend to use a common setup for evaluation under the assumption that providing a uniform environment ensures a fair comparison. However, bias mitigation techniques are sensitive to hyperparameter choices, random seeds, feature selection, etc., meaning that comparison on just one setting can unfairly favour certain algorithms. In this work, we show significant variance in fairness achieved by several algorithms and the influence of the learning pipeline on fairness scores. We highlight that most bias mitigation techniques can achieve comparable performance, given the freedom to perform hyperparameter optimization, suggesting that the choice of the evaluation parameters-rather than the mitigation technique itself-can sometimes create the perceived superiority of one method over another. We hope our work encourages future research on how various choices in the lifecycle of developing an algorithm impact fairness, and trends that guide the selection of appropriate algorithms.
翻译:随着公平性问题在机器学习领域日益受到关注,多种偏差缓解技术相继被提出,研究者常通过相互比较以寻求最优方法。现有基准测试通常采用统一评估框架,其假设在于标准化环境能确保公平比较。然而,偏差缓解技术对超参数选择、随机种子、特征选取等因素高度敏感,这意味着单一配置下的比较可能使某些算法获得不当优势。本研究通过实验证明:多种算法实现的公平性存在显著差异,且学习流程的各个环节均会影响公平性评分。我们特别指出,在允许进行超参数优化的条件下,大多数偏差缓解技术都能达到相近的性能水平,这表明评估参数的选择——而非缓解技术本身——有时会人为造成某方法优于另一方法的表象。我们希望本研究能推动未来关于算法开发周期中各类选择如何影响公平性的探索,并为指导算法选择提供趋势性参考。