Uniformly Distributed Category Prototype-Guided Vision-Language Framework for Long-Tail Recognition

Recently, large-scale pre-trained vision-language models have presented benefits for alleviating class imbalance in long-tailed recognition. However, the long-tailed data distribution can corrupt the representation space, where the distance between head and tail categories is much larger than the distance between two tail categories. This uneven feature space distribution causes the model to exhibit unclear and inseparable decision boundaries on the uniformly distributed test set, which lowers its performance. To address these challenges, we propose the uniformly category prototype-guided vision-language framework to effectively mitigate feature space bias caused by data imbalance. Especially, we generate a set of category prototypes uniformly distributed on a hypersphere. Category prototype-guided mechanism for image-text matching makes the features of different classes converge to these distinct and uniformly distributed category prototypes, which maintain a uniform distribution in the feature space, and improve class boundaries. Additionally, our proposed irrelevant text filtering and attribute enhancement module allows the model to ignore irrelevant noisy text and focus more on key attribute information, thereby enhancing the robustness of our framework. In the image recognition fine-tuning stage, to address the positive bias problem of the learnable classifier, we design the class feature prototype-guided classifier, which compensates for the performance of tail classes while maintaining the performance of head classes. Our method outperforms previous vision-language methods for long-tailed learning work by a large margin and achieves state-of-the-art performance.

翻译：近年来，大规模预训练的视觉-语言模型为缓解长尾识别中的类别不平衡问题带来了益处。然而，长尾数据分布会破坏表示空间，导致头部类别与尾部类别之间的距离远大于两个尾部类别之间的距离。这种不均匀的特征空间分布使得模型在均匀分布的测试集上表现出模糊且不可分离的决策边界，从而降低其性能。为应对这些挑战，我们提出均匀类别原型引导的视觉-语言框架，以有效缓解数据不平衡引起的特征空间偏差。具体而言，我们生成一组在超球面上均匀分布的类别原型。用于图像-文本匹配的类别原型引导机制使不同类别的特征收敛到这些独特且均匀分布的类别原型上，从而在特征空间中保持均匀分布，并改善类别边界。此外，我们提出的无关文本过滤与属性增强模块使模型能够忽略无关的噪声文本，并更专注于关键属性信息，从而增强框架的鲁棒性。在图像识别微调阶段，为解决可学习分类器的正偏差问题，我们设计了类别特征原型引导的分类器，该分类器在保持头部类别性能的同时补偿尾部类别的性能。我们的方法大幅超越了以往用于长尾学习的视觉-语言方法，并达到了最先进的性能。