Contrastive Vision-Language Pre-training(CLIP) demonstrates impressive zero-shot capability. The key to improve the adaptation of CLIP to downstream task with few exemplars lies in how to effectively model and transfer the useful knowledge embedded in CLIP. Previous work mines the knowledge typically based on the limited visual samples and close-set semantics (i.e., within target category set of downstream task). However, the aligned CLIP image/text encoders contain abundant relationships between visual features and almost infinite open semantics, which may benefit the few-shot learning but remains unexplored. In this paper, we propose to mine open semantics as anchors to perform a relation transition from image-anchor relationship to image-target relationship to make predictions. Specifically, we adopt a transformer module which takes the visual feature as "Query", the text features of the anchors as "Key" and the similarity matrix between the text features of anchor and target classes as "Value". In this way, the output of such a transformer module represents the relationship between the image and target categories, i.e., the classification predictions. To avoid manually selecting the open semantics, we make the [CLASS] token of input text embedding learnable. We conduct extensive experiments on eleven representative classification datasets. The results show that our method performs favorably against previous state-of-the-arts considering few-shot classification settings.
翻译:对比视觉语言预训练(CLIP)展现出令人印象深刻的零样本能力。提升CLIP在少量样本条件下适应下游任务的关键,在于如何有效建模并迁移CLIP中嵌入的有用知识。先前工作通常基于有限的视觉样本和闭集语义(即下游任务目标类别集内的语义)进行知识挖掘。然而,对齐的CLIP图像/文本编码器蕴含着视觉特征与近乎无限的开放语义之间的丰富关联,这些关联可能有益于少样本学习,却尚未被充分探索。本文提出挖掘开放语义作为锚点,通过从图像-锚点关系到图像-目标关系的关系迁移来实现预测。具体而言,我们采用一个Transformer模块,以视觉特征作为"查询",锚点的文本特征作为"键",锚点与目标类别的文本特征间的相似度矩阵作为"值"。通过这种方式,该Transformer模块的输出即表示图像与目标类别之间的关系,亦即分类预测结果。为避免人工选择开放语义,我们将输入文本嵌入中的[CLASS]令牌设为可学习参数。我们在11个具有代表性的分类数据集上进行了广泛实验。结果表明,在少样本分类设定下,我们的方法相较于先前最优方法展现出优越性能。