Missing values in real-world data pose a significant and unique challenge to algorithmic fairness. Different demographic groups may be unequally affected by missing data, and the standard procedure for handling missing values where first data is imputed, then the imputed data is used for classification -- a procedure referred to as "impute-then-classify" -- can exacerbate discrimination. In this paper, we analyze how missing values affect algorithmic fairness. We first prove that training a classifier from imputed data can significantly worsen the achievable values of group fairness and average accuracy. This is because imputing data results in the loss of the missing pattern of the data, which often conveys information about the predictive label. We present scalable and adaptive algorithms for fair classification with missing values. These algorithms can be combined with any preexisting fairness-intervention algorithm to handle all possible missing patterns while preserving information encoded within the missing patterns. Numerical experiments with state-of-the-art fairness interventions demonstrate that our adaptive algorithms consistently achieve higher fairness and accuracy than impute-then-classify across different datasets.
翻译:现实数据中的缺失值对算法公平性构成了重大且独特的挑战。不同人口群体可能因缺失数据受到不平等影响,而处理缺失值的标准流程——即先对数据进行插补,再使用插补数据进行分类(称为"先插补后分类")——可能加剧歧视。本文分析了缺失值如何影响算法公平性。我们首先证明,基于插补数据训练分类器会显著恶化群体公平性与平均准确率的可达值。这是因为数据插补会导致数据缺失模式的丢失,而缺失模式通常包含预测标签的相关信息。我们提出了面向含缺失值公平分类的可扩展自适应算法。这些算法可与任意现有公平性干预算法结合,处理所有可能的缺失模式,同时保留缺失模式中编码的信息。基于最新公平性干预方法开展的数值实验表明,我们的自适应算法在不同数据集上均能持续实现比"先插补后分类"更高的公平性与准确率。