Few-shot image classification remains a critical challenge in the field of computer vision, particularly in data-scarce environments. Existing methods typically rely on pre-trained visual-language models, such as CLIP. However, due to the modality gap, which is the inconsistent distribution of image and text features in the joint embedding space, directly using these features as class prototypes often leads to suboptimal performance. To address this issue, we propose a novel Cross-Modal Mapping (CMM) method. This method globally aligns image features with the text feature space through linear transformation and optimizes their local spatial relationships using triplet loss, thereby significantly enhancing cross-modal consistency. Experimental results show that compared to other methods, CMM simplifies the training process and demonstrates higher efficiency. Furthermore, CMM improves the average Top-1 accuracy by 1.06% on 11 benchmark datasets compared to methods that partially fine-tune the backbone, and it performs excellently on 4 distribution shift datasets. Notably, CMM effectively mitigates the modality gap in pre-trained models, enabling text features to serve as effective class prototypes for image features, thus providing an efficient and highly generalizable solution for few-shot learning.
翻译:小样本图像分类在计算机视觉领域,尤其是在数据稀缺环境中,仍然是一个关键挑战。现有方法通常依赖于预训练的视觉-语言模型,例如CLIP。然而,由于模态鸿沟——即图像和文本特征在联合嵌入空间中分布不一致——直接使用这些特征作为类别原型往往导致次优性能。为解决此问题,我们提出了一种新颖的跨模态映射方法。该方法通过线性变换将图像特征全局对齐到文本特征空间,并利用三元组损失优化其局部空间关系,从而显著增强了跨模态一致性。实验结果表明,与其他方法相比,CMM简化了训练过程并展现出更高的效率。此外,与部分微调主干网络的方法相比,CMM在11个基准数据集上的平均Top-1准确率提升了1.06%,并在4个分布偏移数据集上表现优异。值得注意的是,CMM有效缓解了预训练模型中的模态鸿沟,使得文本特征能够作为图像特征的有效类别原型,从而为小样本学习提供了一种高效且高度可泛化的解决方案。