Recent technological advancements have given rise to the ability of collecting vast amounts of data, that often exceed the capacity of commonly used machine learning algorithms. Approaches such as coresets and synthetic data distillation have emerged as frameworks to generate a smaller, yet representative, set of samples for downstream training. As machine learning is increasingly applied to decision-making processes, it becomes imperative for modelers to consider and address biases in the data concerning subgroups defined by factors like race, gender, or other sensitive attributes. Current approaches focus on creating fair synthetic representative samples by optimizing local properties relative to the original samples. These methods, however, are not guaranteed to positively affect the performance or fairness of downstream learning processes. In this work, we present Fair Wasserstein Coresets (FWC), a novel coreset approach which generates fair synthetic representative samples along with sample-level weights to be used in downstream learning tasks. FWC aims to minimize the Wasserstein distance between the original datasets and the weighted synthetic samples while enforcing (an empirical version of) demographic parity, a prominent criterion for algorithmic fairness, via a linear constraint. We show that FWC can be thought of as a constrained version of Lloyd's algorithm for k-medians or k-means clustering. Our experiments, conducted on both synthetic and real datasets, demonstrate the scalability of our approach and highlight the competitive performance of FWC compared to existing fair clustering approaches, even when attempting to enhance the fairness of the latter through fair pre-processing techniques.
翻译:近期技术进步使得大规模数据收集成为可能,但这些数据往往超出常用机器学习算法的处理能力。核心集与合成数据蒸馏等方法应运而生,旨在生成更小但具代表性的样本集用于下游训练。随着机器学习日益应用于决策过程,建模者必须考虑并解决数据中与种族、性别或其他敏感属性相关子群的偏见问题。现有方法通过优化与原始样本相关的局部属性来生成公平的合成代表性样本,但这些方法无法保证对下游学习过程的性能或公平性产生积极影响。本文提出公平Wasserstein核心集(FWC),这是一种新型核心集方法,可生成公平的合成代表性样本及其样本级权重,用于下游学习任务。FWC旨在最小化原始数据集与加权合成样本之间的Wasserstein距离,同时通过线性约束强制实施(经验版本的)人口统计均等——算法公平性的重要准则。我们证明FWC可视为k-中位数或k-均值聚类的Lloyd算法的约束版本。在合成和真实数据集上的实验表明,该方法具有可扩展性,且即使通过公平预处理技术增强现有公平聚类方法的公平性,FWC仍展现出与其相当甚至更优的性能。