Training-Free Unsupervised Prompt for Vision-Language Models

Prompt learning has become the most effective paradigm for adapting large pre-trained vision-language models (VLMs) to downstream tasks. Recently, unsupervised prompt tuning methods, such as UPL and POUF, directly leverage pseudo-labels as supervisory information to fine-tune additional adaptation modules on unlabeled data. However, inaccurate pseudo labels easily misguide the tuning process and result in poor representation capabilities. In light of this, we propose Training-Free Unsupervised Prompts (TFUP), which maximally preserves the inherent representation capabilities and enhances them with a residual connection to similarity-based prediction probabilities in a training-free and labeling-free manner. Specifically, we integrate both instance confidence and prototype scores to select representative samples, which are used to customize a reliable Feature Cache Model (FCM) for training-free inference. Then, we design a Multi-level Similarity Measure (MSM) that considers both feature-level and semantic-level similarities to calculate the distance between each test image and the cached sample as the weight of the corresponding cached label to generate similarity-based prediction probabilities. In this way, TFUP achieves surprising performance, even surpassing the training-base method on multiple classification datasets. Based on our TFUP, we propose a training-based approach (TFUP-T) to further boost the adaptation performance. In addition to the standard cross-entropy loss, TFUP-T adopts an additional marginal distribution entropy loss to constrain the model from a global perspective. Our TFUP-T achieves new state-of-the-art classification performance compared to unsupervised and few-shot adaptation approaches on multiple benchmarks. In particular, TFUP-T improves the classification accuracy of POUF by 3.3% on the most challenging Domain-Net dataset.

翻译：提示学习已成为适配大规模预训练视觉-语言模型（VLM）至下游任务的最有效范式。近期，无监督提示调优方法（如UPL和POUF）直接利用伪标签作为监督信息，在无标签数据上微调额外适配模块。然而，不准确的伪标签容易误导调优过程，导致表征能力下降。鉴于此，我们提出无需训练的无监督提示方法（TFUP），该方法以无需训练且无需标注的方式，最大程度保留模型固有表征能力，并通过残差连接增强基于相似性的预测概率。具体而言，我们融合实例置信度与原型得分选择代表性样本，并基于这些样本定制可靠的特征缓存模型（FCM）以实现无需训练的推理。随后，我们设计多层级相似性度量（MSM），同时考虑特征级与语义级相似性，计算每张测试图像与缓存样本的距离作为对应缓存标签的权重，生成基于相似性的预测概率。通过这种方式，TFUP在多个分类数据集上取得了令人惊讶的性能，甚至超越了基于训练的方法。基于TFUP，我们提出基于训练的方法（TFUP-T）以进一步提升适配性能。除标准交叉熵损失外，TFUP-T引入额外的边际分布熵损失，从全局角度约束模型。在多个基准测试中，与无监督和少样本适配方法相比，我们的TFUP-T实现了最先进的分类性能。特别是在最具挑战性的Domain-Net数据集上，TFUP-T将POUF的分类准确率提升了3.3%。