Modern image classification is based upon directly predicting model classes via large discriminative networks, making it difficult to assess the intuitive visual ``features'' that may constitute a classification decision. At the same time, recent works in joint visual language models such as CLIP provide ways to specify natural language descriptions of image classes but typically focus on providing single descriptions for each class. In this work, we demonstrate that an alternative approach, arguably more akin to our understanding of multiple ``visual features'' per class, can also provide compelling performance in the robust few-shot learning setting. In particular, we automatically enumerate multiple visual descriptions of each class -- via a large language model (LLM) -- then use a vision-image model to translate these descriptions to a set of multiple visual features of each image; we finally use sparse logistic regression to select a relevant subset of these features to classify each image. This both provides an ``intuitive'' set of relevant features for each class, and in the few-shot learning setting, outperforms standard approaches such as linear probing. When combined with finetuning, we also show that the method is able to outperform existing state-of-the-art finetuning approaches on both in-distribution and out-of-distribution performance.
翻译:现代图像分类依赖于通过大型判别网络直接预测模型类别,这使得评估构成分类决策的直观视觉"特征"变得困难。与此同时,近期诸如CLIP等视觉-语言联合模型的研究提供了对图像类别进行自然语言描述的方法,但通常侧重于为每个类别提供单一描述。本研究表明,一种更接近我们对每个类别具有多个"视觉特征"理解的替代方法,同样能在稳健的小样本学习场景中展现出令人信服的性能。具体而言,我们通过大型语言模型(LLM)自动枚举每个类别的多个视觉描述,进而利用视觉-图像模型将这些描述转化为每张图像的多个视觉特征集合;最后,采用稀疏逻辑回归从这些特征中筛选出相关子集对图像进行分类。该方法不仅为每个类别提供了"直观"的相关特征集合,而且在小样本学习场景中超越了线性探针等标准方法。当与微调技术结合时,该方案在分布内与分布外任务上的性能均能超越现有最先进的微调方法。