Vision-Language models (VLMs) that use contrastive language-image pre-training have shown promising zero-shot classification performance. However, their performance on imbalanced dataset is relatively poor, where the distribution of classes in the training dataset is skewed, leading to poor performance in predicting minority classes. For instance, CLIP achieved only 5% accuracy on the iNaturalist18 dataset. We propose to add a lightweight decoder to VLMs to avoid OOM (out of memory) problem caused by large number of classes and capture nuanced features for tail classes. Then, we explore improvements of VLMs using prompt tuning, fine-tuning, and incorporating imbalanced algorithms such as Focal Loss, Balanced SoftMax and Distribution Alignment. Experiments demonstrate that the performance of VLMs can be further boosted when used with decoder and imbalanced methods. Specifically, our improved VLMs significantly outperforms zero-shot classification by an average accuracy of 6.58%, 69.82%, and 6.17%, on ImageNet-LT, iNaturalist18, and Places-LT, respectively. We further analyze the influence of pre-training data size, backbones, and training cost. Our study highlights the significance of imbalanced learning algorithms in face of VLMs pre-trained by huge data. We release our code at https://github.com/Imbalance-VLM/Imbalance-VLM.
翻译:视觉-语言模型通过对比语言-图像预训练在零样本分类任务中展现出优异性能,但在处理不平衡数据集(即训练数据中类别分布存在偏差)时表现欠佳,导致对少数类预测精度较低。例如,CLIP模型在iNaturalist18数据集上仅达到5%的准确率。我们提出为视觉-语言模型附加轻量化解码器,以避免因类别数量过多引发的显存溢出问题,并有效捕捉尾部类别的细粒度特征。继而,我们探索通过提示调优、微调及结合Focal Loss、Balanced SoftMax和Distribution Alignment等不平衡学习算法对模型进行改进。实验表明,当视觉-语言模型与解码器及不平衡处理方法结合使用时,其性能可进一步提升。具体而言,改进后的模型在ImageNet-LT、iNaturalist18和Places-LT数据集上分别较零样本分类平均准确率提升6.58%、69.82%和6.17%。我们进一步分析了预训练数据规模、骨干网络及训练成本的影响。本研究凸显了不平衡学习算法对大规模数据预训练的视觉-语言模型的重要性。代码已开源至https://github.com/Imbalance-VLM/Imbalance-VLM。