This research seeks to benefit the software engineering society by proposing comparative separation, a novel group fairness notion to evaluate the fairness of machine learning software on comparative judgment test data. Fairness issues have attracted increasing attention since machine learning software is increasingly used for high-stakes and high-risk decisions. It is the responsibility of all software developers to make their software accountable by ensuring that the machine learning software do not perform differently on different sensitive groups -- satisfying the separation criterion. However, evaluation of separation requires ground truth labels for each test data point. This motivates our work on analyzing whether separation can be evaluated on comparative judgment test data. Instead of asking humans to provide the ratings or categorical labels on each test data point, comparative judgments are made between pairs of data points such as A is better than B. According to the law of comparative judgment, providing such comparative judgments yields a lower cognitive burden for humans than providing ratings or categorical labels. This work first defines the novel fairness notion comparative separation on comparative judgment test data, and the metrics to evaluate comparative separation. Then, both theoretically and empirically, we show that in binary classification problems, comparative separation is equivalent to separation. Lastly, we analyze the number of test data points and test data pairs required to achieve the same level of statistical power in the evaluation of separation and comparative separation, respectively. This work is the first to explore fairness evaluation on comparative judgment test data. It shows the feasibility and the practical benefits of using comparative judgment test data for model evaluations.
翻译:本研究旨在通过提出一种新颖的群体公平性概念——比较分离,评估机器学习软件在比较判断测试数据上的公平性,从而为软件工程领域带来贡献。随着机器学习软件越来越多地用于高风险决策,公平性问题日益受到关注。确保机器学习软件在不同敏感群体上表现一致——即满足分离准则,是所有软件开发者实现软件问责的责任。然而,分离性的评估需要每个测试数据点的真实标签。这促使我们研究是否能在比较判断测试数据上评估分离性。与要求人类对每个测试数据点提供评分或分类标签不同,比较判断是在数据点对之间进行的,例如判断A优于B。根据比较判断定律,提供此类比较判断比提供评分或分类标签对人类造成的认知负担更低。本文首先定义了基于比较判断测试数据的新颖公平性概念——比较分离,以及评估比较分离的指标。随后,我们从理论和实证两方面证明,在二分类问题中,比较分离等价于分离性。最后,我们分析了在评估分离性与比较分离时,分别需要多少测试数据点和测试数据对才能达到相同的统计功效。本研究首次探索了基于比较判断测试数据的公平性评估,证明了使用此类数据进行模型评估的可行性与实际优势。