Recent years have seen a surge of machine learning approaches aimed at reducing disparities in model outputs across different subgroups. In many settings, training data may be used in multiple downstream applications by different users, which means it may be most effective to intervene on the training data itself. In this work, we present FairWASP, a novel pre-processing approach designed to reduce disparities in classification datasets without modifying the original data. FairWASP returns sample-level weights such that the reweighted dataset minimizes the Wasserstein distance to the original dataset while satisfying (an empirical version of) demographic parity, a popular fairness criterion. We show theoretically that integer weights are optimal, which means our method can be equivalently understood as duplicating or eliminating samples. FairWASP can therefore be used to construct datasets which can be fed into any classification method, not just methods which accept sample weights. Our work is based on reformulating the pre-processing task as a large-scale mixed-integer program (MIP), for which we propose a highly efficient algorithm based on the cutting plane method. Experiments demonstrate that our proposed optimization algorithm significantly outperforms state-of-the-art commercial solvers in solving both the MIP and its linear program relaxation. Further experiments highlight the competitive performance of FairWASP in reducing disparities while preserving accuracy in downstream classification settings.
翻译:近年来,旨在减少不同子群体间模型输出差异的机器学习方法层出不穷。在许多场景中,训练数据可能被不同用户用于多种下游应用,因此对训练数据本身进行干预往往最为有效。本文提出了FairWASP——一种新颖的预处理方法,旨在减少分类数据集中的差异,同时不修改原始数据。FairWASP返回样本级权重,使得加权后的数据集在满足(经验版本的)人口统计均等性(一种流行的公平性准则)的同时,最小化与原始数据集的Wasserstein距离。我们从理论上证明整数权重是最优解,这意味着该方法可等价理解为复制或剔除样本。因此,FairWASP可用于构建能输入任意分类方法(而不仅限于接受样本权重的方法)的数据集。本文通过将预处理任务重新表述为大规模混合整数规划(MIP),并提出一种基于割平面法的高效算法,实现了上述方法。实验表明,在求解MIP及其线性规划松弛问题时,我们提出的优化算法显著优于最先进的商业求解器。进一步实验凸显了FairWASP在保持下游分类精度的同时减少差异的竞争性表现。