Improving Evaluation of Debiasing in Image Classification

Image classifiers often rely overly on peripheral attributes that have a strong correlation with the target class (i.e., dataset bias) when making predictions. Due to the dataset bias, the model correctly classifies data samples including bias attributes (i.e., bias-aligned samples) while failing to correctly predict those without bias attributes (i.e., bias-conflicting samples). Recently, a myriad of studies focus on mitigating such dataset bias, the task of which is referred to as debiasing. However, our comprehensive study indicates several issues need to be improved when conducting evaluation of debiasing in image classification. First, most of the previous studies do not specify how they select their hyper-parameters and model checkpoints (i.e., tuning criterion). Second, the debiasing studies until now evaluated their proposed methods on datasets with excessively high bias-severities, showing degraded performance on datasets with low bias severity. Third, the debiasing studies do not share consistent experimental settings (e.g., datasets and neural networks) which need to be standardized for fair comparisons. Based on such issues, this paper 1) proposes an evaluation metric `Align-Conflict (AC) score' for the tuning criterion, 2) includes experimental settings with low bias severity and shows that they are yet to be explored, and 3) unifies the standardized experimental settings to promote fair comparisons between debiasing methods. We believe that our findings and lessons inspire future researchers in debiasing to further push state-of-the-art performances with fair comparisons.

翻译：图像分类器在做出预测时，往往过度依赖与目标类别高度相关的外围属性（即数据集偏差）。由于数据集偏差的存在，模型能够正确分类包含偏差属性的数据样本（即偏差对齐样本），但无法正确预测不包含偏差属性的样本（即偏差冲突样本）。近年来，大量研究致力于缓解此类数据集偏差，这一任务被称为去偏。然而，我们的综合研究表明，在评估图像分类中的去偏效果时，存在若干需要改进的问题。首先，以往大多数研究并未明确说明如何选择其超参数和模型检查点（即调优准则）。其次，迄今为止的去偏研究均在偏差严重程度过高的数据集上评估其方法，而在低偏差严重程度的数据集上表现欠佳。第三，去偏研究并未采用一致的实验设置（如数据集和神经网络），这些设置需要标准化以进行公平比较。基于这些问题，本文：1）提出了一种用于调优准则的评估指标`对齐-冲突分数`；2）纳入了低偏差严重程度的实验设置，并表明该领域仍有待探索；3）统一了标准化实验设置，以促进去偏方法之间的公平比较。我们相信，我们的发现和经验教训将激励未来去偏领域的研究人员，在公平比较的基础上进一步提升最先进的性能。