Data augmentation methods, especially SoTA interpolation-based methods such as Fair Mixup, have been widely shown to increase model fairness. However, this fairness is evaluated on metrics that do not capture model uncertainty and on datasets with only one, relatively large, minority group. As a remedy, multicalibration has been introduced to measure fairness while accommodating uncertainty and accounting for multiple minority groups. However, existing methods of improving multicalibration involve reducing initial training data to create a holdout set for post-processing, which is not ideal when minority training data is already sparse. This paper uses multicalibration to more rigorously examine data augmentation for classification fairness. We stress-test four versions of Fair Mixup on two structured data classification problems with up to 81 marginalized groups, evaluating multicalibration violations and balanced accuracy. We find that on nearly every experiment, Fair Mixup \textit{worsens} baseline performance and fairness, but the simple vanilla Mixup \textit{outperforms} both Fair Mixup and the baseline, especially when calibrating on small groups. \textit{Combining} vanilla Mixup with multicalibration post-processing, which enforces multicalibration through post-processing on a holdout set, further increases fairness.
翻译:数据增强方法,特别是如Fair Mixup等最先进的基于插值的方法,已被广泛证明能提升模型公平性。然而,这种公平性评估所采用的指标未能捕捉模型不确定性,且仅在仅含一个相对较大的少数群体的数据集上进行评估。作为补救措施,多标定方法被提出以在适应不确定性的同时衡量公平性,并考虑多个少数群体。然而,现有改进多标定的方法需减少初始训练数据以创建用于后处理的保留集,这在少数群体训练数据本就稀疏的情况下并不理想。本文利用多标定方法更严格地审视分类公平性中的数据增强技术。我们在两个结构化数据分类问题上对四个版本的Fair Mixup进行压力测试,涉及多达81个边缘化群体,评估多标定违规情况和平衡准确率。研究发现,在几乎所有的实验中,Fair Mixup反而\textit{恶化}了基线性能与公平性,而简单的原始Mixup\textit{优于}Fair Mixup和基线方法,尤其是在对小群体进行标定时。\textit{结合}原始Mixup与多标定后处理(通过在保留集上进行后处理来强制实现多标定)能进一步提升公平性。