Noisy labels threaten the robustness of few-shot learning (FSL) due to the inexact features in a new domain. CLIP, a large-scale vision-language model, performs well in FSL on image-text embedding similarities, but it is susceptible to misclassification caused by noisy labels. How to enhance domain generalization of CLIP on noisy data within FSL tasks is a critical challenge. In this paper, we provide a novel view to mitigate the influence of noisy labels, CLIP-based Robust Few-shot learning (CRoF). CRoF is a general plug-in module for CLIP-based models. To avoid misclassification and confused label embedding, we design the few-shot task-oriented prompt generator to give more discriminative descriptions of each category. The proposed prompt achieves larger distances of inter-class textual embedding. Furthermore, rather than fully trusting zero-shot classification by CLIP, we fine-tune CLIP on noisy few-shot data in a new domain with a weighting strategy like label-smooth. The weights for multiple potentially correct labels consider the relationship between CLIP's prior knowledge and original label information to ensure reliability. Our multiple label loss function further supports robust training under this paradigm. Comprehensive experiments show that CRoF, as a plug-in, outperforms fine-tuned and vanilla CLIP models on different noise types and noise ratios.
翻译:噪声标签因新领域特征不精确而威胁少样本学习(FSL)的鲁棒性。大规模视觉语言模型CLIP凭借图像-文本嵌入相似性在FSL中表现优异,但其易受噪声标签导致的误分类影响。如何在FSL任务中增强CLIP对噪声数据的领域泛化能力是关键挑战。本文提出一种抑制噪声标签影响的新视角——基于CLIP的鲁棒少样本学习(CRoF)。CRoF是CLIP系列模型的通用插件模块。为避免误分类与标签嵌入混淆,我们设计了面向少样本任务的提示生成器,为每个类别提供更具判别性的描述。所提出的提示实现了更大的类间文本嵌入距离。此外,我们并非完全依赖CLIP的零样本分类,而是采用类标签平滑的加权策略在新领域噪声少样本数据上微调CLIP。多重潜在正确标签的权重综合考虑了CLIP先验知识与原始标签信息的关系以确保可靠性。我们的多标签损失函数进一步支持该范式下的鲁棒训练。综合实验表明,作为插件模块的CRoF在不同噪声类型和噪声比例下均优于微调版及原始CLIP模型。