Optimal transport (OT) has an important role in transforming data distributions in a manner which engenders fairness. Typically, the OT operators are learnt from the unfair attribute-labelled data, and then used for their repair. Two significant limitations of this approach are as follows: (i) the OT operators for underrepresented subgroups are poorly learnt (i.e. they are susceptible to representation bias); and (ii) these OT repairs cannot be effected on identically distributed but out-of-sample (i.e.\ archival) data. In this paper, we address both of these problems by adopting a Bayesian nonparametric stopping rule for learning each attribute-labelled component of the data distribution. The induced OT-optimal quantization operators can then be used to repair the archival data. We formulate a novel definition of the fair distributional target, along with quantifiers that allow us to trade fairness against damage in the transformed data. These are used to reveal excellent performance of our representation-bias-tolerant scheme in simulated and benchmark data sets.
翻译:最优传输(OT)在转换数据分布以促进公平性方面具有重要作用。通常,OT算子从不公平的属性标记数据中学习得到,随后用于数据修复。该方法存在两个显著局限性:(i)对于代表性不足的子组,OT算子的学习效果较差(即易受表征偏差影响);(ii)这些OT修复无法在分布相同但样本外(即存档)数据上实施。本文通过采用贝叶斯非参数停止规则来学习数据分布中每个属性标记分量,从而同时解决这两个问题。由此导出的OT最优量化算子可用于修复存档数据。我们提出了一种新的公平分布目标定义,以及允许在转换数据中权衡公平性与损伤程度的量化指标。这些工具被用于展示我们提出的耐受表征偏差方案在模拟和基准数据集上的优异性能。