Data distillation and coresets have emerged as popular approaches to generate a smaller representative set of samples for downstream learning tasks to handle large-scale datasets. At the same time, machine learning is being increasingly applied to decision-making processes at a societal level, making it imperative for modelers to address inherent biases towards subgroups present in the data. Current approaches create fair synthetic representative samples by optimizing local properties relative to the original samples, but their effect on downstream learning processes has yet to be explored. In this work, we present fair Wasserstein coresets (FWC), a novel coreset approach which generates fair synthetic representative samples along with sample-level weights to be used in downstream learning tasks. FWC minimizes the Wasserstein distance between the original dataset and the weighted synthetic samples while enforcing demographic parity. We show that an unconstrained version of FWC is equivalent to Lloyd's algorithm for k-medians and k-means clustering. Experiments conducted on both synthetic and real datasets show that FWC: (i) achieves a competitive fairness-performance tradeoff in downstream models compared to existing approaches, (ii) improves downstream fairness when added to the existing training data and (iii) can be used to reduce biases in predictions from large language models (GPT-3.5 and GPT-4).
翻译:数据蒸馏与核心集已成为生成较小代表性样本集以处理大规模数据集的流行方法,用于下游学习任务。与此同时,机器学习正越来越多地应用于社会层面的决策过程,这使得建模者必须处理数据中对子群体存在的固有偏差。当前方法通过优化相对于原始样本的局部属性来生成公平的合成代表性样本,但尚未探究其对下游学习过程的影响。本研究提出公平Wasserstein核心集(FWC),这是一种新颖的核心集方法,可生成公平的合成代表性样本及样本权重,用于下游学习任务。FWC在强制实现人口统计平等方面,最小化原始数据集与加权合成样本之间的Wasserstein距离。我们证明,无约束版本的FWC等价于用于k中位数和k均值聚类的Lloyd算法。在合成数据集与真实数据集上进行的实验表明,FWC:(i) 相较于现有方法,在下游模型中实现了有竞争力的公平性-性能权衡,(ii) 在添加到现有训练数据时改善了下游公平性,以及(iii) 可用于减少大型语言模型(GPT-3.5和GPT-4)预测中的偏差。