Ensuring fair predictions across many distinct subpopulations in the training data can be prohibitive for large models. Recently, simple linear last layer retraining strategies, in combination with data augmentation methods such as upweighting, downsampling and mixup, have been shown to achieve state-of-the-art performance for worst-group accuracy, which quantifies accuracy for the least prevalent subpopulation. For linear last layer retraining and the abovementioned augmentations, we present the optimal worst-group accuracy when modeling the distribution of the latent representations (input to the last layer) as Gaussian for each subpopulation. We evaluate and verify our results for both synthetic and large publicly available datasets.
翻译:确保模型在训练数据中众多不同子群体上的预测公平性,对于大型模型而言往往难以实现。近期研究表明,结合上采样、下采样及混叠等数据增强方法的简单线性末层重训练策略,在衡量最不常见子群体准确率的最差组准确率指标上,已获得目前最优性能。针对线性末层重训练及上述增强方法,我们将每个子群体潜在表示(末层输入)的分布建模为高斯分布,并推演出理论最优的最差组准确率。我们通过合成数据集及大型公开数据集对结果进行了评估与验证。