Missing values in real-world data pose a significant and unique challenge to algorithmic fairness. Different demographic groups may be unequally affected by missing data, and the standard procedure for handling missing values where first data is imputed, then the imputed data is used for classification -- a procedure referred to as "impute-then-classify" -- can exacerbate discrimination. In this paper, we analyze how missing values affect algorithmic fairness. We first prove that training a classifier from imputed data can significantly worsen the achievable values of group fairness and average accuracy. This is because imputing data results in the loss of the missing pattern of the data, which often conveys information about the predictive label. We present scalable and adaptive algorithms for fair classification with missing values. These algorithms can be combined with any preexisting fairness-intervention algorithm to handle all possible missing patterns while preserving information encoded within the missing patterns. Numerical experiments with state-of-the-art fairness interventions demonstrate that our adaptive algorithms consistently achieve higher fairness and accuracy than impute-then-classify across different datasets.
翻译:现实世界数据中的缺失值对算法公平性构成了显著且独特的挑战。不同人口群体可能受到缺失数据的不平等影响,而处理缺失值的标准流程——先对数据进行插补,再使用插补后的数据进行分类(即"先插补再分类")——可能会加剧歧视。本文分析了缺失值如何影响算法公平性。我们首先证明,从插补数据训练分类器会显著恶化群体公平性和平均准确率的可达值。这是因为数据插补会导致数据缺失模式的丢失,而缺失模式通常包含预测标签的信息。我们提出了可扩展且自适应的公平分类算法,用于处理含缺失值的数据。这些算法可与任何现有的公平性干预算法结合,处理所有可能的缺失模式,同时保留缺失模式中编码的信息。与最先进公平性干预方法进行的数值实验表明,我们提出的自适应算法在不同数据集上始终比"先插补再分类"方法获得更高的公平性和准确性。