Data distillation and coresets have emerged as popular approaches to generate a smaller representative set of samples for downstream learning tasks to handle large-scale datasets. At the same time, machine learning is being increasingly applied to decision-making processes at a societal level, making it imperative for modelers to address inherent biases towards subgroups present in the data. While current approaches focus on creating fair synthetic representative samples by optimizing local properties relative to the original samples, their impact on downstream learning processes has yet to be explored. In this work, we present fair Wasserstein coresets (FWC), a novel coreset approach which generates fair synthetic representative samples along with sample-level weights to be used in downstream learning tasks. FWC uses an efficient majority minimization algorithm to minimize the Wasserstein distance between the original dataset and the weighted synthetic samples while enforcing demographic parity. We show that an unconstrained version of FWC is equivalent to Lloyd's algorithm for k-medians and k-means clustering. Experiments conducted on both synthetic and real datasets show that FWC: (i) achieves a competitive fairness-utility tradeoff in downstream models compared to existing approaches, (ii) improves downstream fairness when added to the existing training data and (iii) can be used to reduce biases in predictions from large language models (GPT-3.5 and GPT-4).
翻译:数据蒸馏与核心集已成为处理大规模数据集时,为下游学习任务生成较小代表性样本集的流行方法。与此同时,机器学习日益应用于社会层面的决策过程,这使得建模者必须解决数据中存在的对子群的固有偏见。虽然当前方法侧重于通过优化相对于原始样本的局部属性来创建公平的合成代表性样本,但它们对下游学习过程的影响尚未得到探索。在本工作中,我们提出了公平Wasserstein核心集(FWC),这是一种新颖的核心集方法,可生成公平的合成代表性样本以及用于下游学习任务的样本级权重。FWC采用高效的多数据最小化算法,在强制人口统计均等的同时,最小化原始数据集与加权合成样本之间的Wasserstein距离。我们证明,无约束版本的FWC等价于k-中位数和k-均值聚类的Lloyd算法。在合成和真实数据集上进行的实验表明,FWC:(i)与现有方法相比,在下游模型中实现了具有竞争力的公平性-效用权衡;(ii)当添加到现有训练数据时,能提升下游公平性;(iii)可用于减少大型语言模型(GPT-3.5和GPT-4)预测中的偏见。