Vision-language models (VLMs) like CLIP are trained with the objective of aligning text and image pairs. To improve CLIP-based few-shot image classification, recent works have observed that, along with text embeddings, image embeddings from the training set are an important source of information. In this work we investigate the impact of directly mixing image and text prototypes for few-shot classification and analyze this from a bias-variance perspective. We show that mixing prototypes acts like a shrinkage estimator. Although mixed prototypes improve classification performance, the image prototypes still add some noise in the form of instance-specific background or context information. In order to capture only information from the image space relevant to the given classification task, we propose projecting image prototypes onto the principal directions of the semantic text embedding space to obtain a text-aligned semantic image subspace. These text-aligned image prototypes, when mixed with text embeddings, further improve classification. However, for downstream datasets with poor cross-modal alignment in CLIP, semantic alignment might be suboptimal. We show that the image subspace can still be leveraged by modeling the anisotropy using class covariances. We demonstrate that combining a text-aligned mixed prototype classifier and an image-specific LDA classifier outperforms existing methods across few-shot classification benchmarks.
翻译:视觉-语言模型(如CLIP)以对齐文本与图像对为目标进行训练。为改进基于CLIP的少样本图像分类,近期研究发现,除文本嵌入外,训练集中的图像嵌入也是重要信息来源。本文从偏差-方差角度直接探讨了图像与文本原型混合对少样本分类的影响,并论证混合原型可作为收缩估计器。尽管混合原型提升了分类性能,但图像原型仍会引入实例特有的背景或上下文信息形式的噪声。为仅捕获与给定分类任务相关的图像空间信息,我们提出将图像原型投影至语义文本嵌入空间的主方向上,以构建与文本对齐的语义图像子空间。当这些经文本对齐的图像原型与文本嵌入混合时,可进一步改善分类效果。然而对于CLIP中跨模态对齐较弱的下游数据集,语义对齐可能不理想。研究表明,可通过使用类协方差建模各向异性来利用图像子空间。我们证明,结合文本对齐的混合原型分类器与图像特定的LDA分类器,在多个少样本分类基准测试中均优于现有方法。