Large-scale pre-trained Vision-Language Models (VLMs), such as CLIP, establish the correlation between texts and images, achieving remarkable success on various downstream tasks with fine-tuning. In existing fine-tuning methods, the class-specific text description is matched against the whole image. We recognize that this whole image matching is not effective since images from the same class often contain a set of different semantic objects, and an object further consists of a set of semantic parts or concepts. Individual semantic parts or concepts may appear in image samples from different classes. To address this issue, in this paper, we develop a new method called cross-model concept learning and inference (CCLI). Using the powerful text-image correlation capability of CLIP, our method automatically learns a large set of distinctive visual concepts from images using a set of semantic text concepts. Based on these visual concepts, we construct a discriminative representation of images and learn a concept inference network to perform downstream image classification tasks, such as few-shot learning and domain generalization. Extensive experimental results demonstrate that our CCLI method is able to improve the performance upon the current state-of-the-art methods by large margins, for example, by up to 8.0% improvement on few-shot learning and by up to 1.3% for domain generalization.
翻译:大规模预训练视觉-语言模型(VLM),如CLIP,建立了文本与图像之间的关联,通过微调在多种下游任务中取得了显著成功。现有微调方法将类别特定的文本描述与整张图像进行匹配。我们发现这种整张图像匹配的方式效果不佳,因为同一类别的图像往往包含一组不同的语义对象,而每个对象又由一组语义部件或概念组成。这些独立的语义部件或概念可能出现在不同类别的图像样本中。为解决此问题,本文提出了一种名为跨模态概念学习与推理(CCLI)的新方法。借助CLIP强大的文本-图像关联能力,我们的方法利用一组语义文本概念,从图像中自动学习大量独特的视觉概念。基于这些视觉概念,我们构建了图像的判别性表示,并学习一个概念推理网络来执行下游图像分类任务,例如少样本学习和领域泛化。大量实验结果表明,我们的CCLI方法能够大幅超越当前最先进的方法,例如在少样本学习上实现高达8.0%的性能提升,在领域泛化上实现高达1.3%的提升。