NLPrompt: Noise-Label Prompt Learning for Vision-Language Models

The emergence of vision-language foundation models, such as CLIP, has revolutionized image-text representation, enabling a broad range of applications via prompt learning. Despite its promise, real-world datasets often contain noisy labels that can degrade prompt learning performance. In this paper, we demonstrate that using mean absolute error (MAE) loss in prompt learning, named PromptMAE, significantly enhances robustness against noisy labels while maintaining high accuracy. Though MAE is straightforward and recognized for its robustness, it is rarely used in noisy-label learning due to its slow convergence and poor performance outside prompt learning scenarios. To elucidate the robustness of PromptMAE, we leverage feature learning theory to show that MAE can suppress the influence of noisy samples, thereby improving the signal-to-noise ratio and enhancing overall robustness. Additionally, we introduce PromptOT, a prompt-based optimal transport data purification method to enhance the robustness further. PromptOT employs text features in vision-language models as prototypes to construct an optimal transportation matrix. This matrix effectively partitions datasets into clean and noisy subsets, allowing for the application of cross-entropy loss to the clean subset and MAE loss to the noisy subset. Our Noise-Label Prompt Learning method, named NLPrompt, offers a simple and efficient approach that leverages the expressive representations and precise alignment capabilities of vision-language models for robust prompt learning. We validate NLPrompt through extensive experiments across various noise settings, demonstrating significant performance improvements.

翻译：视觉语言基础模型（如CLIP）的出现彻底改变了图像-文本表示，通过提示学习实现了广泛的应用。尽管前景广阔，但现实世界的数据集常包含噪声标签，可能降低提示学习的性能。本文证明，在提示学习中采用平均绝对误差（MAE）损失（称为PromptMAE）能显著增强对噪声标签的鲁棒性，同时保持高精度。尽管MAE方法简洁且以鲁棒性著称，但由于其收敛速度慢且在提示学习场景外表现不佳，在噪声标签学习中很少被采用。为阐明PromptMAE的鲁棒性机制，我们借助特征学习理论证明MAE能够抑制噪声样本的影响，从而提高信噪比并增强整体鲁棒性。此外，我们提出了PromptOT——一种基于提示的最优传输数据净化方法以进一步提升鲁棒性。PromptOT利用视觉语言模型中的文本特征作为原型构建最优传输矩阵，该矩阵能有效将数据集划分为干净子集与噪声子集，从而可对干净子集应用交叉熵损失，对噪声子集应用MAE损失。我们提出的噪声标签提示学习方法NLPrompt，提供了一种简洁高效的途径，能够利用视觉语言模型的表达性表征与精准对齐能力实现鲁棒的提示学习。我们通过多种噪声设置下的广泛实验验证了NLPrompt，结果显示出显著的性能提升。