This paper introduces innovative benchmarks to evaluate Vision-Language Models (VLMs) in real-world zero-shot recognition tasks, focusing on the granularity and specificity of prompting text. We propose a unique evaluation protocol using adapted ImageNet and MS-COCO datasets to assess models' consistency in recognizing concepts at varying granularity levels and their sensitivity to the specificity of language inputs. Our extensive evaluation reveals that state-of-the-art VLMs, including contrastive models like CLIP, struggle with granularity and are sensitive to text specificity, impacting their effectiveness in open-world settings. This comprehensive study, a first in evaluating VLMs from these perspectives, provides valuable insights and tools for the community, highlighting the limitations and paving the way for enhanced models with better generalization in zero-shot recognition.
翻译:本文提出了创新的基准测试,旨在评估视觉-语言模型(VLM)在真实世界零样本识别任务中的表现,重点关注提示文本的粒度与特异性。我们设计了一套独特的评估协议,利用适配后的ImageNet和MS-COCO数据集,检验模型在不同粒度层级下识别概念的连贯性,以及其对语言输入特异性的敏感度。广泛评估表明,包括CLIP等对比模型在内的最先进VLM,在处理粒度方面存在困难,且对文本特异性敏感,这影响了它们在开放世界场景中的有效性。作为首个从这些维度评估VLM的综合性研究,本工作为学界提供了宝贵的见解与工具,揭示了现有模型的局限性,并为提升零样本识别中模型泛化能力奠定了方向。