We propose a novel framework for few-shot learning by leveraging large-scale vision-language models such as CLIP. Motivated by the unimodal prototypical networks for few-shot learning, we introduce PROTO-CLIP that utilizes image prototypes and text prototypes for few-shot learning. Specifically, PROTO-CLIP adapts the image encoder and text encoder in CLIP in a joint fashion using few-shot examples. The two encoders are used to compute prototypes of image classes for classification. During adaptation, we propose aligning the image and text prototypes of corresponding classes. Such a proposed alignment is beneficial for few-shot classification due to the contributions from both types of prototypes. We demonstrate the effectiveness of our method by conducting experiments on benchmark datasets for few-shot learning as well as in the real world for robot perception.
翻译:我们提出了一种利用大规模视觉-语言模型(如CLIP)进行小样本学习的新型框架。受小样本学习中单模态原型网络的启发,我们引入了PROTO-CLIP,该模型利用图像原型和文本原型实现小样本学习。具体而言,PROTO-CLIP通过小样本示例以联合方式适配CLIP中的图像编码器和文本编码器,并利用这两个编码器计算图像类别的原型以进行分类。在适配过程中,我们提出对齐对应类别的图像原型与文本原型。由于两类原型的共同贡献,这种对齐有利于提升小样本分类性能。通过在基准数据集上的小样本学习实验以及真实世界的机器人感知任务,我们验证了该方法的有效性。