Vision-language models (VLMs) have made substantial progress across a wide range of visual question answering benchmarks, spanning visual reasoning, document understanding, and multimodal dialogue. These improvements are evident in a wide range of VLMs built on a variety of base models, alignment architectures, and training data. However, recent works show that these models trail behind in traditional image classification benchmarks, which test fine-grained visual knowledge. We test a large number of recent VLMs on fine-grained classification benchmarks and identify potential factors in the disconnect between fine-grained knowledge and other vision benchmarks. Through a series of ablation experiments, we find that using a better LLM improves all benchmark scores equally, while a better vision encoder disproportionately improves fine-grained classification performance. Furthermore, we find that the pretraining stage is also vital to fine-grained performance, particularly when the language model weights are unfrozen during pretraining. These insights pave the way for enhancing fine-grained visual understanding and vision-centric capabilities in VLMs.
翻译:视觉语言模型(VLMs)在视觉问答基准测试中取得了显著进展,涵盖视觉推理、文档理解和多模态对话等多个领域。这些改进体现在基于多种基础模型、对齐架构和训练数据构建的各类VLM中。然而,近期研究表明,这些模型在测试细粒度视觉知识的传统图像分类基准上表现滞后。我们在多个细粒度分类基准上测试了大量近期提出的VLM,并探究了细粒度知识与其他视觉基准表现脱节的潜在因素。通过一系列消融实验,我们发现使用更优的LLM能平等提升所有基准分数,而更优的视觉编码器则能显著提升细粒度分类性能。此外,我们发现预训练阶段对细粒度性能同样至关重要,特别是在预训练期间语言模型权重未冻结的情况下。这些见解为增强VLM的细粒度视觉理解及以视觉为中心的能力指明了方向。