With the introduction of machine learning in high-stakes decision making, ensuring algorithmic fairness has become an increasingly important problem to solve. In response to this, many mathematical definitions of fairness have been proposed, and a variety of optimisation techniques have been developed, all designed to maximise a defined notion of fairness. However, fair solutions are reliant on the quality of the training data, and can be highly sensitive to noise. Recent studies have shown that robustness (the ability for a model to perform well on unseen data) plays a significant role in the type of strategy that should be used when approaching a new problem and, hence, measuring the robustness of these strategies has become a fundamental problem. In this work, we therefore propose a new criterion to measure the robustness of various fairness optimisation strategies - the robustness ratio. We conduct multiple extensive experiments on five bench mark fairness data sets using three of the most popular fairness strategies with respect to four of the most popular definitions of fairness. Our experiments empirically show that fairness methods that rely on threshold optimisation are very sensitive to noise in all the evaluated data sets, despite mostly outperforming other methods. This is in contrast to the other two methods, which are less fair for low noise scenarios but fairer for high noise ones. To the best of our knowledge, we are the first to quantitatively evaluate the robustness of fairness optimisation strategies. This can potentially can serve as a guideline in choosing the most suitable fairness strategy for various data sets.
翻译:随着机器学习在高风险决策中的应用,确保算法公平性已成为一个日益重要的问题。为此,人们提出了多种数学公平性定义,并开发了各类优化技术,旨在最大化特定的公平性指标。然而,公平解决方案依赖于训练数据的质量,且对噪声高度敏感。近期研究表明,鲁棒性(模型在未见数据上表现良好的能力)在选择解决新问题的策略时起着关键作用,因此衡量这些策略的鲁棒性已成为一个基本问题。为此,我们提出了一项新标准——鲁棒比率,用于衡量不同公平优化策略的鲁棒性。我们在五个基准公平数据集上,针对四种最流行的公平性定义,使用三种最常用的公平策略进行了大量实验。实验结果表明,依赖阈值优化的公平方法在所有评估数据集中均对噪声极为敏感,尽管它们在多数情况下优于其他方法。这与另外两种方法形成对比,后者在低噪声场景下公平性较差,但在高噪声场景下公平性更优。据我们所知,我们是首个定量评估公平优化策略鲁棒性的研究。这项工作有望为不同数据集选择最合适的公平策略提供指导。