Image recognition has recently witnessed a paradigm shift, where vision-language models are now used to perform few-shot classification based on textual prompts. Among these, the CLIP model has shown remarkable capabilities for zero-shot transfer by matching an image and a custom textual prompt in its latent space. This has paved the way for several works that focus on engineering or learning textual contexts for maximizing CLIP's classification capabilities. In this paper, we follow this trend by learning an ensemble of prompts for image classification. We show that learning diverse and possibly shorter contexts improves considerably and consistently the results rather than relying on a single trainable prompt. In particular, we report better few-shot capabilities with no additional cost at inference time. We demonstrate the capabilities of our approach on 11 different benchmarks.
翻译:图像识别近期经历了范式转变,视觉-语言模型现被用于基于文本提示进行小样本分类。其中,CLIP模型通过在其潜在空间中对图像与自定义文本提示进行匹配,展现出零样本迁移的卓越能力。这一进展催生了多项专注于设计或学习文本上下文以最大化CLIP分类能力的研究。本文延续这一趋势,提出学习一组集成提示用于图像分类。研究表明,学习多样化且可能更短的上下文不仅能显著提升结果,还能保持一致性,其效果优于依赖单个可训练提示的方法。特别地,我们以推理阶段零额外成本实现了更优的小样本性能。我们在11个不同基准测试上验证了该方法的能力。