Vision-language foundation models pretrained on large-scale data provide a powerful tool for many visual understanding tasks. Notably, many vision-language models build two encoders (visual and textual) that can map two modalities into the same embedding space. As a result, the learned representations achieve good zero-shot performance on tasks like image classification. However, when there are only a few examples per category, the potential of large vision-language models is often underperformed, mainly due to the gap between a large number of parameters and a relatively small amount of training data. This paper shows that we can significantly improve the performance of few-shot classification by using the category names to initialize the classification head. With the proposed category name initialization method, our model obtains the state-of-the-art performance on a number of few-shot image classification benchmarks (e.g., 87.37% on ImageNet and 96.08% on Stanford Cars, both using five-shot learning).
翻译:视觉-语言基础模型在大规模数据上的预训练为许多视觉理解任务提供了强大工具。值得注意的是,许多视觉-语言模型构建了双编码器(视觉编码器和文本编码器),能够将两种模态映射到同一嵌入空间。因此,学习到的表征在图像分类等任务上展现出良好的零样本性能。然而,当每个类别仅有少量样本时,大型视觉-语言模型的潜力往往未能充分发挥,这主要归因于大量参数与相对较少的训练数据之间的差距。本文表明,通过使用类别名称初始化分类头,我们可以显著提升少样本分类的性能。采用所提出的类别名称初始化方法,我们的模型在多个少样本图像分类基准上取得了最先进的性能(例如,在ImageNet上达到87.37%,在Stanford Cars上达到96.08%,均采用五样本学习)。