We propose a novel framework for few-shot learning by leveraging large-scale vision-language models such as CLIP. Motivated by unimodal prototypical networks for few-shot learning, we introduce Proto-CLIP which utilizes image prototypes and text prototypes for few-shot learning. Specifically, Proto-CLIP adapts the image and text encoder embeddings from CLIP in a joint fashion using few-shot examples. The embeddings from the two encoders are used to compute the respective prototypes of image classes for classification. During adaptation, we propose aligning the image and text prototypes of the corresponding classes. Such alignment is beneficial for few-shot classification due to the reinforced contributions from both types of prototypes. Proto-CLIP has both training-free and fine-tuned variants. We demonstrate the effectiveness of our method by conducting experiments on benchmark datasets for few-shot learning, as well as in the real world for robot perception. The project page is available at https://irvlutd.github.io/Proto-CLIP
翻译:我们提出了一种利用CLIP等大规模视觉-语言模型进行小样本学习的新框架。受单模态原型网络在小样本学习中的启发,我们引入了Proto-CLIP,该模型利用图像原型和文本原型进行小样本学习。具体而言,Proto-CLIP以小样本示例联合调整CLIP中的图像编码器和文本编码器嵌入。两个编码器生成的嵌入分别用于计算图像类别的原型以进行分类。在适应过程中,我们提出对齐对应类别的图像原型与文本原型。由于两类原型的贡献得到增强,这种对齐对小样本分类具有增益效果。Proto-CLIP包含免训练和微调两种变体。我们通过在标准小样本学习基准数据集以及真实机器人感知任务中的实验,验证了方法的有效性。项目页面详见 https://irvlutd.github.io/Proto-CLIP。