This paper investigates the challenges of applying vision-language models (VLMs) to zero-shot visual recognition tasks in an open-world setting, with a focus on contrastive vision-language models such as CLIP. We first examine the performance of VLMs on concepts of different granularity levels. We propose a way to fairly evaluate the performance discrepancy under two experimental setups and find that VLMs are better at recognizing fine-grained concepts. Furthermore, we find that the similarity scores from VLMs do not strictly reflect the correctness of the textual inputs given visual input. We propose an evaluation protocol to test our hypothesis that the scores can be biased towards more informative descriptions, and the nature of the similarity score between embedding makes it challenging for VLMs to recognize the correctness between similar but wrong descriptions. Our study highlights the challenges of using VLMs in open-world settings and suggests directions for future research to improve their zero-shot capabilities.
翻译:本文研究了在开放世界场景下将视觉语言模型(VLMs)应用于零样本视觉识别任务所面临的挑战,重点关注CLIP等对比式视觉语言模型。我们首先考察了VLMs在不同粒度层级概念上的表现性能。我们提出了一种在两种实验设置下公平评估性能差异的方法,并发现VLMs在识别细粒度概念方面表现更优。此外,我们发现VLMs输出的相似度分数并不能严格反映给定视觉输入时文本描述的正确性。我们设计了一个评估方案来验证以下假设:相似度分数可能偏向信息量更丰富的描述,而嵌入间相似度分数的本质特性使得VLMs难以区分相似但错误的描述之间的正确性。本研究揭示了在开放世界场景下使用VLMs所面临的挑战,并为提升其零样本能力的未来研究指明了方向。