A Large-scale Empirical Study on Improving the Fairness of Deep Learning Models

Fairness has been a critical issue that affects the adoption of deep learning models in real practice. To improve model fairness, many existing methods have been proposed and evaluated to be effective in their own contexts. However, there is still no systematic evaluation among them for a comprehensive comparison under the same context, which makes it hard to understand the performance distinction among them, hindering the research progress and practical adoption of them. To fill this gap, this paper endeavours to conduct the first large-scale empirical study to comprehensively compare the performance of existing state-of-the-art fairness improving techniques. Specifically, we target the widely-used application scenario of image classification, and utilized three different datasets and five commonly-used performance metrics to assess in total 13 methods from diverse categories. Our findings reveal substantial variations in the performance of each method across different datasets and sensitive attributes, indicating over-fitting on specific datasets by many existing methods. Furthermore, different fairness evaluation metrics, due to their distinct focuses, yield significantly different assessment results. Overall, we observe that pre-processing methods and in-processing methods outperform post-processing methods, with pre-processing methods exhibiting the best performance. Our empirical study offers comprehensive recommendations for enhancing fairness in deep learning models. We approach the problem from multiple dimensions, aiming to provide a uniform evaluation platform and inspire researchers to explore more effective fairness solutions via a set of implications.

翻译：公平性一直是影响深度学习模型在实际应用中推广的关键问题。为提升模型公平性，现有诸多方法已在各自场景中被提出并验证有效，但尚缺乏在统一背景下对它们进行系统性比较的综合评估，这导致难以理解不同方法的性能差异，阻碍了研究进展与实际应用。为填补这一空白，本文开展了首次大规模实证研究，全面比较现有最先进公平性改进技术的性能。具体而言，我们聚焦于图像分类这一广泛应用场景，利用三个不同数据集和五种常用性能指标，评估了来自不同类别的共计13种方法。研究结果表明，每种方法在不同数据集和敏感属性上的表现存在显著差异，表明许多现有方法存在对特定数据集的过拟合现象。此外，不同公平性评估指标由于侧重点各异，得出的评估结果存在显著差异。总体而言，我们发现预处理方法（pre-processing methods）和过程中处理方法（in-processing methods）优于后处理方法（post-processing methods），其中预处理方法表现最佳。我们的实证研究从多个维度为增强深度学习模型公平性提供了全面建议，旨在建立统一的评估平台，并通过一系列启示激励研究人员探索更有效的公平性解决方案。