Unlike the case when using a balanced training dataset, the per-class recall (i.e., accuracy) of neural networks trained with an imbalanced dataset are known to vary a lot from category to category. The convention in long-tailed recognition is to manually split all categories into three subsets and report the average accuracy within each subset. We argue that under such an evaluation setting, some categories are inevitably sacrificed. On one hand, focusing on the average accuracy on a balanced test set incurs little penalty even if some worst performing categories have zero accuracy. On the other hand, classes in the "Few" subset do not necessarily perform worse than those in the "Many" or "Medium" subsets. We therefore advocate to focus more on improving the lowest recall among all categories and the harmonic mean of all recall values. Specifically, we propose a simple plug-in method that is applicable to a wide range of methods. By simply re-training the classifier of an existing pre-trained model with our proposed loss function and using an optional ensemble trick that combines the predictions of the two classifiers, we achieve a more uniform distribution of recall values across categories, which leads to a higher harmonic mean accuracy while the (arithmetic) average accuracy is still high. The effectiveness of our method is justified on widely used benchmark datasets.
翻译:在使用不平衡数据集训练神经网络时,各类别的召回率(即准确率)与使用平衡训练集的情况不同,其类别间差异显著。长尾识别领域的常规做法是将所有类别手动划分为三个子集,并报告各子集的平均准确率。我们认为在这种评估设置下,某些类别不可避免地会被牺牲。一方面,聚焦于平衡测试集上的平均准确率时,即使表现最差的类别准确率为零,也几乎不会受到惩罚。另一方面,"稀少"子集中的类别未必比"丰富"或"中等"子集中的类别表现更差。因此,我们主张应更关注提升所有类别中的最低召回率以及所有召回率的调和平均值。具体而言,我们提出了一种适用于多种方法的简易插件式方案。通过使用我们设计的损失函数对现有预训练模型的分类器进行重新训练,并采用可选的集成技巧(即结合两个分类器的预测结果),我们实现了各类别召回率更均匀的分布,从而在保持较高算术平均准确率的同时,获得更高的调和平均准确率。该方法的有效性已在广泛使用的基准数据集上得到验证。