Recently, large-scale pre-trained Vision and Language (VL) models have set a new state-of-the-art (SOTA) in zero-shot visual classification enabling open-vocabulary recognition of potentially unlimited set of categories defined as simple language prompts. However, despite these great advances, the performance of these zeroshot classifiers still falls short of the results of dedicated (closed category set) classifiers trained with supervised fine tuning. In this paper we show, for the first time, how to reduce this gap without any labels and without any paired VL data, using an unlabeled image collection and a set of texts auto-generated using a Large Language Model (LLM) describing the categories of interest and effectively substituting labeled visual instances of those categories. Using our label-free approach, we are able to attain significant performance improvements over the zero-shot performance of the base VL model and other contemporary methods and baselines on a wide variety of datasets, demonstrating absolute improvement of up to 11.7% (3.8% on average) in the label-free setting. Moreover, despite our approach being label-free, we observe 1.3% average gains over leading few-shot prompting baselines that do use 5-shot supervision.
翻译:近期,大规模预训练的视觉-语言(VL)模型在零样本视觉分类任务中取得了新的最优性能,能够通过简单的语言提示实现对潜在无限类别集合的开放词汇识别。然而,尽管取得了重大进展,这些零样本分类器的性能仍落后于通过监督微调训练的专业(封闭类别集)分类器。本文首次展示了如何在不使用任何标签及配对VL数据的情况下缩小这一差距——仅利用无标注图像集和通过大语言模型自动生成的描述目标类别的文本集合,有效替代这些类别的有标注视觉实例。通过无标签方法,我们在多种数据集上相较于基础VL模型的零样本性能及其他当代方法与基线实现了显著提升,在无标签设置下绝对改进高达11.7%(平均3.8%)。此外,尽管我们的方法不依赖标签,相较于使用5-shot监督的领先少样本提示基线,我们仍观察到平均1.3%的性能提升。