Recent advances in large-scale vision-language models have achieved impressive performance in various zero-shot image classification tasks. While prior studies have demonstrated significant improvements by introducing few-shot labelled target samples, they still require labelling of target samples, which greatly degrades their scalability and generalizability while handling various visual recognition tasks. We design NtUA, a Noise-tolerant Unsupervised Adapter that allows the learning of effective target models with few unlabelled target samples. NtUA works as a key-value cache that formulates visual features and predicted pseudo-labels of the few unlabelled target samples as key-value pairs. It consists of two complementary designs. The first is adaptive cache formation that combats pseudo-label noises by weighting the key-value pairs according to their prediction confidence. The second is knowledge-guided cache refinement, which refines pair values (i.e., pseudo-labels) and cache weights by leveraging knowledge distillation from large-scale vision language models. Extensive experiments show that NtUA achieves superior performance consistently across multiple widely adopted benchmarks.
翻译:大规模视觉语言模型的最新进展在各种零样本图像分类任务中取得了令人瞩目的性能。尽管先前研究通过引入少量标记目标样本已展现出显著改进,但这些方法仍需对目标样本进行标注,这在处理多样视觉识别任务时严重削弱了其可扩展性与泛化能力。本文设计了一种噪声容忍无监督适配器NtUA,该适配器能够利用少量未标记目标样本学习有效的目标模型。NtUA作为键值缓存机制运行,将少量未标记目标样本的视觉特征与预测伪标签构建为键值对。该框架包含两项互补设计:其一是自适应缓存构建机制,通过依据预测置信度对键值对进行加权处理以抵御伪标签噪声;其二是知识引导的缓存优化机制,通过利用大规模视觉语言模型的知识蒸馏来优化配对值(即伪标签)与缓存权重。大量实验表明,NtUA在多个广泛采用的基准测试中均能持续取得优越性能。