Seeing is Believing: Robust Vision-Guided Cross-Modal Prompt Learning under Label Noise

Prompt learning is a parameter-efficient approach for vision-language models, yet its robustness under label noise is less investigated. Visual content contains richer and more reliable semantic information, which remains more robust under label noise. However, the prompt itself is highly susceptible to label noise. Motivated by this intuition, we propose VisPrompt, a lightweight and robust vision-guided prompt learning framework for noisy-label settings. Specifically, we exploit a cross-modal attention mechanism to reversely inject visual semantics into prompt representations. This enables the prompt tokens to selectively aggregate visual information relevant to the current sample, thereby improving robustness by anchoring prompt learning to stable instance-level visual evidence and reducing the influence of noisy supervision. To address the instability caused by using the same way of injecting visual information for all samples, despite differences in the quality of their visual cues, we further introduce a lightweight conditional modulation mechanism to adaptively control the strength of visual information injection, which strikes a more robust balance between text-side semantic priors and image-side instance evidence. The proposed framework effectively suppresses the noise-induced disturbances, reduce instability in prompt updates, and alleviate memorization of mislabeled samples. VisPrompt significantly improves robustness while keeping the pretrained VLM backbone frozen and introducing only a small amount of additional trainable parameters. Extensive experiments under synthetic and real-world label noise demonstrate that VisPrompt generally outperforms existing baselines on seven benchmark datasets and achieves stronger robustness. Our code is publicly available at https://github.com/gezbww/Vis_Prompt.

翻译：提示学习是视觉-语言模型中一种参数高效的方法，但其在标签噪声下的鲁棒性研究较少。视觉内容包含更丰富且可靠的语义信息，在标签噪声中更为鲁棒。然而，提示本身极易受标签噪声影响。受此直觉启发，我们提出VisPrompt——一种面向噪声标签场景的轻量级鲁棒视觉引导提示学习框架。具体而言，我们利用跨模态注意力机制将视觉语义反向注入提示表征，使提示令牌能够选择性聚合与当前样本相关的视觉信息，通过将提示学习锚定到稳定的实例级视觉证据来提升鲁棒性，并减少噪声监督的影响。针对不同样本视觉线索质量差异导致统一注入方式的不稳定性，我们进一步引入轻量级条件调制机制，自适应控制视觉信息注入强度，在文本侧语义先验与图像侧实例证据间建立更鲁棒的平衡。该框架有效抑制噪声引起的扰动，降低提示更新不稳定性，并缓解对错误标注样本的记忆。VisPrompt在保持预训练VLM骨干冻结且仅引入少量可训练参数的情况下，显著提升鲁棒性。在合成与真实标签噪声下的广泛实验表明，VisPrompt在七个基准数据集上全面优于现有基线方法，展现出更强的鲁棒性。我们的代码已公开于https://github.com/gezbww/Vis_Prompt。