Image classification is one of the most fundamental capabilities of machine vision intelligence. In this work, we revisit the image classification task using visually-grounded language models (VLMs) such as GPT-4V and LLaVA. We find that existing proprietary and public VLMs, despite often using CLIP as a vision encoder and having many more parameters, significantly underperform CLIP on standard image classification benchmarks like ImageNet. To understand the reason, we explore several hypotheses concerning the inference algorithms, training objectives, and data processing in VLMs. Our analysis reveals that the primary cause is data-related: critical information for image classification is encoded in the VLM's latent space but can only be effectively decoded with enough training data. Specifically, there is a strong correlation between the frequency of class exposure during VLM training and instruction-tuning and the VLM's performance in those classes; when trained with sufficient data, VLMs can match the accuracy of state-of-the-art classification models. Based on these findings, we enhance a VLM by integrating classification-focused datasets into its training, and demonstrate that the enhanced classification performance of the VLM transfers to its general capabilities, resulting in an improvement of 11.8% on the newly collected ImageWikiQA dataset.
翻译:图像分类是机器视觉智能最核心的能力之一。本研究重新审视了使用视觉基础语言模型(如GPT-4V和LLaVA)进行图像分类的任务。我们发现,现有的专有和公开视觉基础语言模型——尽管常以CLIP作为视觉编码器且参数量更大——在ImageNet等标准图像分类基准上的表现显著逊于CLIP。为探究其原因,我们针对视觉基础语言模型的推理算法、训练目标和数据处理提出了若干假设。分析表明,其主要原因与数据相关:图像分类的关键信息虽被编码于视觉基础语言模型的隐空间中,但仅当获得足够训练数据时才能被有效解码。具体而言,视觉基础语言模型在训练和指令微调阶段接触各类别的频率与其在这些类别上的性能存在强相关性;当获得充分数据训练时,视觉基础语言模型能达到最先进分类模型的精度水平。基于这些发现,我们通过将聚焦分类的数据集整合到训练中增强了一个视觉基础语言模型,并证明该模型增强的分类能力可迁移至其通用功能,从而在新收集的ImageWikiQA数据集上实现了11.8%的性能提升。