Contrastive vision-language models like CLIP exhibit remarkable zero-shot generalization. However, prompt tuning remains highly sensitive to label noise, as mislabeled samples generate disproportionately large gradients that can overwhelm pre-trained priors. We argue that because CLIP already provides a near-optimal initialization, adaptation should be inherently conservative, particularly against the extreme gradient updates common in noisy settings. To this end, we propose Double-Softmax Prompt Tuning (DSPT), a hyperparameter-free method for intrinsic gradient suppression. By applying a sequential probabilistic normalization, DSPT induces a self-adaptive saturation zone that suppresses gradients from high-error noisy samples while maintaining informative updates. We also provide both theoretical analysis and empirical evidence about how this mechanism achieves adaptive suppression. This design transforms ``gradient vanishing'', traditionally a training bottleneck, into a principled noise-filtering shield for label-noise prompt tuning. Extensive experiments confirm that this simple, drop-in design achieves state-of-the-art robustness across various noisy benchmarks, outperforming methods with complex architectures and handcrafted hyperparameters.
翻译:对比性视觉-语言模型(如CLIP)展现出卓越的零样本泛化能力。然而,提示调优对标签噪声高度敏感,因为错误标记的样本会产生异常大的梯度,从而可能压倒预训练先验。我们认为,由于CLIP已提供接近最优的初始化,适配过程应具有内在保守性,尤其是在噪声环境下常见的极端梯度更新情况下。为此,我们提出双Softmax提示调优方法(DSPT),这是一种无超参数的内在梯度抑制方法。通过应用顺序概率归一化,DSPT能够构建一个自适应饱和区,在抑制高误差噪声样本梯度的同时,保持有信息量的梯度更新。我们还从理论和实验两方面论证了该机制实现自适应抑制的原理。这种设计将传统上被视为训练瓶颈的“梯度消失”现象,转化为标签噪声提示调优中一种原则性的噪声过滤屏障。大量实验证明,这种简单且可直接嵌入的设计,在各类噪声基准测试中均取得了最先进的鲁棒性,超越了采用复杂架构和人工设计超参数的方法。