Data distillation and coresets have emerged as popular approaches to generate a smaller representative set of samples for downstream learning tasks to handle large-scale datasets. At the same time, machine learning is being increasingly applied to decision-making processes at a societal level, making it imperative for modelers to address inherent biases towards subgroups present in the data. While current approaches focus on creating fair synthetic representative samples by optimizing local properties relative to the original samples, their impact on downstream learning processes has yet to be explored. In this work, we present fair Wasserstein coresets (FWC), a novel coreset approach which generates fair synthetic representative samples along with sample-level weights to be used in downstream learning tasks. FWC uses an efficient majority minimization algorithm to minimize the Wasserstein distance between the original dataset and the weighted synthetic samples while enforcing demographic parity. We show that an unconstrained version of FWC is equivalent to Lloyd's algorithm for k-medians and k-means clustering. Experiments conducted on both synthetic and real datasets show that FWC: (i) achieves a competitive fairness-utility tradeoff in downstream models compared to existing approaches, (ii) improves downstream fairness when added to the existing training data and (iii) can be used to reduce biases in predictions from large language models (GPT-3.5 and GPT-4).
翻译:数据蒸馏和核心集已成为生成较小代表性样本集以处理大规模数据集并用于下游学习任务的流行方法。同时,机器学习正越来越多地应用于社会层面的决策过程,这使得建模者必须解决数据中对子群体存在的固有偏见。目前的方法侧重于通过优化与原始样本相关的局部属性来创建公平的合成代表性样本,但其对下游学习过程的影响尚未得到充分探索。在本工作中,我们提出公平沃瑟斯坦核心集(FWC),这是一种新颖的核心集方法,能够生成公平的合成代表性样本及其样本级权重,以用于下游学习任务。FWC采用高效的主极小化算法,在强制实现群体均等的同时,最小化原始数据集与加权合成样本之间的沃瑟斯坦距离。我们证明,FWC的无约束版本等价于k-中位数和k-均值聚类的劳埃德算法。在合成数据集和真实数据集上进行的实验表明,FWC:(i)与现有方法相比,在下游模型中实现了具有竞争力的公平性与效用的权衡;(ii)在将FWC添加到现有训练数据后能提升下游公平性;(iii)可用于减少大型语言模型(GPT-3.5和GPT-4)预测中的偏差。