The contrastive vision-language pre-training, known as CLIP, demonstrates remarkable potential in perceiving open-world visual concepts, enabling effective zero-shot image recognition. Nevertheless, few-shot learning methods based on CLIP typically require offline fine-tuning of the parameters on few-shot samples, resulting in longer inference time and the risk of over-fitting in certain domains. To tackle these challenges, we propose the Meta-Adapter, a lightweight residual-style adapter, to refine the CLIP features guided by the few-shot samples in an online manner. With a few training samples, our method can enable effective few-shot learning capabilities and generalize to unseen data or tasks without additional fine-tuning, achieving competitive performance and high efficiency. Without bells and whistles, our approach outperforms the state-of-the-art online few-shot learning method by an average of 3.6\% on eight image classification datasets with higher inference speed. Furthermore, our model is simple and flexible, serving as a plug-and-play module directly applicable to downstream tasks. Without further fine-tuning, Meta-Adapter obtains notable performance improvements in open-vocabulary object detection and segmentation tasks.
翻译:基于对比学习的视觉语言预训练模型(CLIP)在感知开放世界视觉概念方面展现出显著潜力,可有效实现零样本图像识别。然而,基于CLIP的少样本学习方法通常需要在少样本样本上对参数进行离线微调,导致推理时间延长且在某些领域存在过拟合风险。针对这些挑战,我们提出Meta-Adapter——一种轻量级残差式适配器,通过少样本样本以在线方式精炼CLIP特征。仅需少量训练样本,本方法即可实现有效的少样本学习能力,并可直接泛化至未见数据或任务而无需额外微调,兼具竞争性性能与高效性。无需复杂设计,本方法在八个图像分类数据集上以更快的推理速度,平均超越现有最先进的在线少样本学习方法3.6%。此外,本模型简洁灵活,可作为即插即用模块直接应用于下游任务。无需进一步微调,Meta-Adapter在开放词汇目标检测与分割任务中均取得显著性能提升。