Few-shot classification aims to adapt to new tasks with limited labeled examples. To fully use the accessible data, recent methods explore suitable measures for the similarity between the query and support images and better high-dimensional features with meta-training and pre-training strategies. However, the potential of multi-modality information has barely been explored, which may bring promising improvement for few-shot classification. In this paper, we propose a Language-guided Prototypical Network (LPN) for few-shot classification, which leverages the complementarity of vision and language modalities via two parallel branches. Concretely, to introduce language modality with limited samples in the visual task, we leverage a pre-trained text encoder to extract class-level text features directly from class names while processing images with a conventional image encoder. Then, a language-guided decoder is introduced to obtain text features corresponding to each image by aligning class-level features with visual features. In addition, to take advantage of class-level features and prototypes, we build a refined prototypical head that generates robust prototypes in the text branch for follow-up measurement. Finally, we aggregate the visual and text logits to calibrate the deviation of a single modality. Extensive experiments demonstrate the competitiveness of LPN against state-of-the-art methods on benchmark datasets.
翻译:小样本分类旨在利用有限的标注样本适应新任务。为充分使用可用数据,近期方法通过元训练和预训练策略探索了查询图像与支持图像之间相似性的恰当度量方式,并提取了更好的高维特征。然而,多模态信息的潜力尚未得到充分挖掘,而这可能为小样本分类带来有前景的提升。本文提出一种语言引导的原型网络(LPN)用于小样本分类,通过两个并行分支利用视觉与语言模态的互补性。具体而言,为在视觉任务中引入少量样本下的语言模态,我们采用预训练文本编码器直接从类别名称中提取类别级文本特征,同时使用传统图像编码器处理图像;随后引入语言引导解码器,通过对齐类别级特征与视觉特征,获取每张图像对应的文本特征。此外,为充分利用类别级特征与原型,我们构建了精炼原型头,在文本分支中生成鲁棒的原型以用于后续度量。最后,聚合视觉和文本logits以校准单一模态的偏差。大量实验表明,LPN在基准数据集上相较于最先进方法具有竞争力。