Background: Fairness testing for deep learning systems has been becoming increasingly important. However, much work assumes perfect context and conditions from the other parts: well-tuned hyperparameters for accuracy; rectified bias in data, and mitigated bias in the labeling. Yet, these are often difficult to achieve in practice due to their resource-/labour-intensive nature. Aims: In this paper, we aim to understand how varying contexts affect fairness testing outcomes. Method:We conduct an extensive empirical study, which covers $10,800$ cases, to investigate how contexts can change the fairness testing result at the model level against the existing assumptions. We also study why the outcomes were observed from the lens of correlation/fitness landscape analysis. Results: Our results show that different context types and settings generally lead to a significant impact on the testing, which is mainly caused by the shifts of the fitness landscape under varying contexts. Conclusions: Our findings provide key insights for practitioners to evaluate the test generators and hint at future research directions.
翻译:背景:深度学习系统的公平性测试日益重要。然而,许多研究假设其他部分具备理想的情境与条件:经过充分调优以获得高准确率的超参数;数据中已纠正的偏见;以及标注过程中已缓解的偏差。但由于这些工作通常需要大量资源与人力,在实践中往往难以实现。目标:本文旨在探究不同情境如何影响公平性测试的结果。方法:我们开展了一项覆盖$10,800$个案例的大规模实证研究,以探究在模型层面上,情境如何改变公平性测试结果,从而检验现有假设。我们还通过相关性/适应度地形分析视角,探究了观察到的结果背后的原因。结果:我们的研究表明,不同的情境类型与设置通常会对测试产生显著影响,这主要源于不同情境下适应度地形的偏移。结论:本研究为实践者评估测试生成器提供了关键见解,并指出了未来的研究方向。