Machine learning algorithms permeate the day-to-day aspects of our lives and therefore studying the fairness of these algorithms before implementation is crucial. One way in which bias can manifest in a dataset is through missing values. Missing data are often assumed to be missing completely randomly; in reality the propensity of data being missing is often tied to the demographic characteristics of individuals. There is limited research into how missing values and the handling thereof can impact the fairness of an algorithm. Most researchers either apply listwise deletion or tend to use simpler methods of imputation (e.g. mean or mode) compared to more advanced approaches (e.g. multiple imputation). This study considers the fairness of various classification algorithms after a range of missing data handling strategies is applied. Missing values are generated (i.e. amputed) in three popular datasets for classification fairness, by creating a high percentage of missing values using three missing data mechanisms. The results show that the missing data mechanism does not significantly impact fairness; across the missing data handling techniques listwise deletion gives the highest fairness on average and amongst the classification algorithms random forests leads to the highest fairness on average. The interaction effect of the missing data handling technique and the classification algorithm is also often significant.
翻译:机器学习算法已渗透到我们日常生活的方方面面,因此在算法实施前研究其公平性至关重要。数据集中可能通过缺失值体现偏差。缺失数据通常被假定为完全随机缺失;然而现实中,数据缺失的倾向性往往与个体的社会人口特征相关。目前关于缺失值及其处理方法如何影响算法公平性的研究较为有限。与更先进的方法(如多重插补)相比,大多数研究者要么采用列表删除法,要么倾向于使用更简单的插补方法(如均值或众数插补)。本研究考察了应用一系列缺失数据处理策略后,各种分类算法的公平性表现。通过在三个常用的分类公平性数据集中,采用三种缺失数据机制生成高比例缺失值(即人为制造缺失)。结果表明:缺失数据机制对公平性无显著影响;在各类缺失数据处理技术中,列表删除法平均能获得最高的公平性;在分类算法中,随机森林平均能实现最高的公平性。缺失数据处理技术与分类算法之间的交互效应也往往具有显著性。