In the real world, data tends to follow long-tailed distributions w.r.t. class or attribution, motivating the challenging Long-Tailed Recognition (LTR) problem. In this paper, we revisit recent LTR methods with promising Vision Transformers (ViT). We figure out that 1) ViT is hard to train with long-tailed data. 2) ViT learns generalized features in an unsupervised manner, like mask generative training, either on long-tailed or balanced datasets. Hence, we propose to adopt unsupervised learning to utilize long-tailed data. Furthermore, we propose the Predictive Distribution Calibration (PDC) as a novel metric for LTR, where the model tends to simply classify inputs into common classes. Our PDC can measure the model calibration of predictive preferences quantitatively. On this basis, we find many LTR approaches alleviate it slightly, despite the accuracy improvement. Extensive experiments on benchmark datasets validate that PDC reflects the model's predictive preference precisely, which is consistent with the visualization.
翻译:在现实世界中,数据在类别或属性上往往遵循长尾分布,这催生了具有挑战性的长尾识别问题。本文中,我们重新审视了近年来基于视觉变换器的长尾识别方法。我们发现:1)视觉变换器难以在长尾数据上进行训练;2)视觉变换器通过无监督方式学习通用特征(如掩码生成训练),无论是在长尾还是均衡数据集上均如此。因此,我们提出采用无监督学习来利用长尾数据。此外,我们提出了预测分布校准作为长尾识别的新指标,该指标旨在解决模型倾向于将输入简单分类为常见类别的问题。我们的预测分布校准可以定量衡量模型预测偏好的校准程度。在此基础上,我们发现许多长尾识别方法尽管提升了准确率,但对此问题的缓解效果有限。在基准数据集上的大量实验验证,预测分布校准能精确反映模型的预测偏好,且与可视化结果一致。