Vision-language models (VLMs) excel in zero-shot recognition but their performance varies greatly across different visual concepts. For example, although CLIP achieves impressive accuracy on ImageNet (60-80%), its performance drops below 10% for more than ten concepts like night snake, presumably due to their limited presence in the pretraining data. However, measuring the frequency of concepts in VLMs' large-scale datasets is challenging. We address this by using large language models (LLMs) to count the number of pretraining texts that contain synonyms of these concepts. Our analysis confirms that popular datasets, such as LAION, exhibit a long-tailed concept distribution, yielding biased performance in VLMs. We also find that downstream applications of VLMs, including visual chatbots (e.g., GPT-4V) and text-to-image models (e.g., Stable Diffusion), often fail to recognize or generate images of rare concepts identified by our method. To mitigate the imbalanced performance of zero-shot VLMs, we propose REtrieval-Augmented Learning (REAL). First, instead of prompting VLMs using the original class names, REAL uses their most frequent synonyms found in pretraining texts. This simple change already outperforms costly human-engineered and LLM-enriched prompts over nine benchmark datasets. Second, REAL trains a linear classifier on a small yet balanced set of pretraining data retrieved using concept synonyms. REAL surpasses the previous zero-shot SOTA, using 400x less storage and 10,000x less training time!
翻译:视觉语言模型(VLM)在零样本识别中表现出色,但不同视觉概念下的性能差异显著。例如,尽管CLIP在ImageNet上取得了令人瞩目的准确率(60-80%),但对夜蛇等十余种概念的识别准确率却低于10%,这很可能是由于这些概念在预训练数据中出现频率有限。然而,衡量VLM大规模数据集中概念的出现频次颇具挑战。为此,我们利用大语言模型(LLM)统计包含这些概念同义词的预训练文本数量。分析证实,LAION等流行数据集存在长尾概念分布,导致VLM产生有偏性能。我们还发现,包括视觉聊天机器人(如GPT-4V)和文生图模型(如Stable Diffusion)在内的VLM下游应用,往往无法正确识别或生成我们方法所识别的稀有概念图像。为缓解零样本VLM的性能失衡问题,我们提出检索增强学习(REAL)。首先,REAL不再使用原始类别名称提示VLM,而是采用预训练文本中这些概念最频繁出现的同义词。这一简单修改在九个基准数据集上已超越了昂贵的人工构建和LLM增强提示方法。其次,REAL利用概念同义词检索出的少量平衡预训练数据训练线性分类器。REAL以400倍存储空间和10000倍训练时间的减少,超越了先前零样本SOTA方法!