With the advent of the AI Act and other regulations, there is now an urgent need for algorithms that repair unfairness in training data. In this paper, we define fairness in terms of conditional independence between protected attributes ($S$) and features ($X$), given unprotected attributes ($U$). We address the important setting in which torrents of archival data need to be repaired, using only a small proportion of these data, which are $S|U$-labelled (the research data). We use the latter to design optimal transport (OT)-based repair plans on interpolated supports. This allows {\em off-sample}, labelled, archival data to be repaired, subject to stationarity assumptions. It also significantly reduces the size of the supports of the OT plans, with correspondingly large savings in the cost of their design and of their {\em sequential\/} application to the off-sample data. We provide detailed experimental results with simulated and benchmark real data (the Adult data set). Our performance figures demonstrate effective repair -- in the sense of quenching conditional dependence -- of large quantities of off-sample, labelled (archival) data.
翻译:随着《人工智能法案》等法规的出台,修复训练数据中不公平性的算法需求日益迫切。本文将公平性定义为:在给定非受保护属性($U$)的条件下,受保护属性($S$)与特征($X$)之间的条件独立性。我们研究了一个重要场景:需要修复海量档案数据,但仅能使用其中一小部分带有$S|U$标签的研究数据。我们利用后者在插值支撑集上设计基于最优传输(OT)的修复方案,从而在平稳性假设下实现对档案数据中离样本、带标签数据的修复。该方法显著降低了OT方案支撑集的规模,相应大幅节约了方案设计及其对离样本数据顺序应用的成本。我们通过模拟数据和基准真实数据(Adult数据集)提供了详细的实验结果。性能指标表明,该方法能有效修复大量离样本、带标签的档案数据——即消除条件依赖关系。