Data augmentation methods, especially SoTA interpolation-based methods such as Fair Mixup, have been widely shown to increase model fairness. However, this fairness is evaluated on metrics that do not capture model uncertainty and on datasets with only one, relatively large, minority group. As a remedy, multicalibration has been introduced to measure fairness while accommodating uncertainty and accounting for multiple minority groups. However, existing methods of improving multicalibration involve reducing initial training data to create a holdout set for post-processing, which is not ideal when minority training data is already sparse. This paper uses multicalibration to more rigorously examine data augmentation for classification fairness. We stress-test four versions of Fair Mixup on two structured data classification problems with up to 81 marginalized groups, evaluating multicalibration violations and balanced accuracy. We find that on nearly every experiment, Fair Mixup \textit{worsens} baseline performance and fairness, but the simple vanilla Mixup \textit{outperforms} both Fair Mixup and the baseline, especially when calibrating on small groups. \textit{Combining} vanilla Mixup with multicalibration post-processing, which enforces multicalibration through post-processing on a holdout set, further increases fairness.
翻译:数据增强方法,尤其是如Fair Mixup等最先进的基于插值的方法,已被广泛证明能提升模型公平性。然而,这种公平性是在无法捕捉模型不确定性的指标上评估的,且所用数据集通常仅包含一个相对较大的少数群体。作为补救措施,多校准方法被引入以在衡量公平时兼顾不确定性并考虑多个少数群体。然而,现有改进多校准的方法需要减少初始训练数据以创建用于后处理的保留集,这在少数群体训练数据本就稀疏的情况下并非理想方案。本文利用多校准方法更严格地检验数据增强对分类公平性的影响。我们在两个包含多达81个边缘化群体的结构化数据分类问题上,对四种版本的Fair Mixup进行了压力测试,评估了多校准违规情况和平衡准确率。研究发现,在几乎所有的实验中,Fair Mixup反而会恶化基线性能和公平性,而简单的原始Mixup在公平性和性能上均优于Fair Mixup和基线方法,尤其是在对小群体进行校准时。将原始Mixup与多校准后处理相结合——即通过保留集上的后处理来强制实现多校准——能进一步提升公平性。