Pre-trained multi-modal vision-language models (VLMs) are becoming increasingly popular due to their exceptional performance on downstream vision applications, particularly in the few- and zero-shot settings. However, selecting the best-performing VLM for some downstream applications is non-trivial, as it is dataset and task-dependent. Meanwhile, the exhaustive evaluation of all available VLMs on a novel application is not only time and computationally demanding but also necessitates the collection of a labeled dataset for evaluation. As the number of open-source VLM variants increases, there is a need for an efficient model selection strategy that does not require access to a curated evaluation dataset. This paper proposes a novel task and benchmark for efficiently evaluating VLMs' zero-shot performance on downstream applications without access to the downstream task dataset. Specifically, we introduce a new task LOVM: Language-Only Vision Model Selection, where methods are expected to perform both model selection and performance prediction based solely on a text description of the desired downstream application. We then introduced an extensive LOVM benchmark consisting of ground-truth evaluations of 35 pre-trained VLMs and 23 datasets, where methods are expected to rank the pre-trained VLMs and predict their zero-shot performance.
翻译:预训练多模态视觉语言模型(VLM)因其在下游视觉任务(尤其在少样本和零样本设置)中的卓越表现而日益普及。然而,为特定下游应用选择性能最佳的VLM并非易事,因为这取决于具体数据集和任务。同时,对新应用全面评估所有可用VLM不仅耗时且计算成本高昂,还需收集标注数据集进行评估。随着开源VLM变种数量的增加,亟需一种无需访问精心策划的评估数据集的高效模型选择策略。本文提出一项新颖任务与基准,用于在无法访问下游任务数据集的情况下高效评估VLM的零样本性能。具体而言,我们引入新任务LOVM:仅基于语言的视觉模型选择,要求方法仅依据所需下游应用的文本描述进行模型选择与性能预测。继而构建了包含35个预训练VLM和23个数据集真实评估结果的LOVM基准,要求方法对预训练VLM进行排序并预测其零样本性能。