Prompt learning has become the most effective paradigm for adapting large pre-trained vision-language models (VLMs) to downstream tasks. Recently, unsupervised prompt tuning methods, such as UPL and POUF, directly leverage pseudo-labels as supervisory information to fine-tune additional adaptation modules on unlabeled data. However, inaccurate pseudo labels easily misguide the tuning process and result in poor representation capabilities. In light of this, we propose Training-Free Unsupervised Prompts (TFUP), which maximally preserves the inherent representation capabilities and enhances them with a residual connection to similarity-based prediction probabilities in a training-free and labeling-free manner. Specifically, we integrate both instance confidence and prototype scores to select representative samples, which are used to customize a reliable Feature Cache Model (FCM) for training-free inference. Then, we design a Multi-level Similarity Measure (MSM) that considers both feature-level and semantic-level similarities to calculate the distance between each test image and the cached sample as the weight of the corresponding cached label to generate similarity-based prediction probabilities. In this way, TFUP achieves surprising performance, even surpassing the training-base method on multiple classification datasets. Based on our TFUP, we propose a training-based approach (TFUP-T) to further boost the adaptation performance. In addition to the standard cross-entropy loss, TFUP-T adopts an additional marginal distribution entropy loss to constrain the model from a global perspective. Our TFUP-T achieves new state-of-the-art classification performance compared to unsupervised and few-shot adaptation approaches on multiple benchmarks. In particular, TFUP-T improves the classification accuracy of POUF by 3.3% on the most challenging Domain-Net dataset.
翻译:提示学习已成为适配大规模预训练视觉-语言模型(VLM)至下游任务的最有效范式。近期,无监督提示调优方法(如UPL和POUF)直接利用伪标签作为监督信息,在无标签数据上微调额外适配模块。然而,不准确的伪标签容易误导调优过程,导致表征能力下降。鉴于此,我们提出无需训练的无监督提示方法(TFUP),该方法以无需训练且无需标注的方式,最大程度保留模型固有表征能力,并通过残差连接增强基于相似性的预测概率。具体而言,我们融合实例置信度与原型得分选择代表性样本,并基于这些样本定制可靠的特征缓存模型(FCM)以实现无需训练的推理。随后,我们设计多层级相似性度量(MSM),同时考虑特征级与语义级相似性,计算每张测试图像与缓存样本的距离作为对应缓存标签的权重,生成基于相似性的预测概率。通过这种方式,TFUP在多个分类数据集上取得了令人惊讶的性能,甚至超越了基于训练的方法。基于TFUP,我们提出基于训练的方法(TFUP-T)以进一步提升适配性能。除标准交叉熵损失外,TFUP-T引入额外的边际分布熵损失,从全局角度约束模型。在多个基准测试中,与无监督和少样本适配方法相比,我们的TFUP-T实现了最先进的分类性能。特别是在最具挑战性的Domain-Net数据集上,TFUP-T将POUF的分类准确率提升了3.3%。