NLPrompt: Noise-Label Prompt Learning for Vision-Language Models

The emergence of vision-language foundation models, such as CLIP, has revolutionized image-text representation, enabling a broad range of applications via prompt learning. Despite its promise, real-world datasets often contain noisy labels that can degrade prompt learning performance. In this paper, we demonstrate that using mean absolute error (MAE) loss in prompt learning, named PromptMAE, significantly enhances robustness against noisy labels while maintaining high accuracy. Though MAE is straightforward and recognized for its robustness, it is rarely used in noisy-label learning due to its slow convergence and poor performance outside prompt learning scenarios. To elucidate the robustness of PromptMAE, we leverage feature learning theory to show that MAE can suppress the influence of noisy samples, thereby improving the signal-to-noise ratio and enhancing overall robustness. Additionally, we introduce PromptOT, a prompt-based optimal transport data purification method to enhance the robustness further. PromptOT employs text features in vision-language models as prototypes to construct an optimal transportation matrix. This matrix effectively partitions datasets into clean and noisy subsets, allowing for the application of cross-entropy loss to the clean subset and MAE loss to the noisy subset. Our Noise-Label Prompt Learning method, named NLPrompt, offers a simple and efficient approach that leverages the expressive representations and precise alignment capabilities of vision-language models for robust prompt learning. We validate NLPrompt through extensive experiments across various noise settings, demonstrating significant performance improvements.

翻译：视觉语言基础模型（如CLIP）的出现彻底改变了图像-文本表示学习，通过提示学习实现了广泛的应用。尽管前景广阔，现实数据集常包含噪声标签，可能损害提示学习的性能。本文证明，在提示学习中采用平均绝对误差（MAE）损失（称为PromptMAE）能显著提升对噪声标签的鲁棒性，同时保持高精度。虽然MAE因其鲁棒性而受到认可，但由于其在非提示学习场景中收敛缓慢且性能不佳，鲜少被用于噪声标签学习。为阐明PromptMAE的鲁棒性机制，我们借助特征学习理论证明MAE能够抑制噪声样本的影响，从而提升信噪比并增强整体鲁棒性。此外，我们提出PromptOT——一种基于提示的最优传输数据净化方法，以进一步提升鲁棒性。PromptOT利用视觉语言模型中的文本特征作为原型构建最优传输矩阵，该矩阵能有效将数据集划分为干净子集与噪声子集，从而可对干净子集应用交叉熵损失，对噪声子集应用MAE损失。我们提出的噪声标签提示学习方法NLPrompt，提供了一种简洁高效的框架，其通过利用视觉语言模型的强表征能力与精准对齐特性来实现鲁棒的提示学习。我们在多种噪声设置下进行了大量实验验证，结果表明NLPrompt能带来显著的性能提升。