Underrepresented in Foundation Model Pretraining Data? A One-Shot Probe

Large-scale Vision-Language Foundation Models (VLFMs), such as CLIP, now underpin a wide range of computer vision research and applications. VLFMs are often adapted to various domain-specific tasks. However, VLFM performance on novel, specialised, or underrepresented domains remains inconsistent. Evaluating VLFMs typically requires labelled test sets, which are often unavailable for niche domains of interest, particularly those from the Global South. We address this gap by proposing a highly data-efficient method to predict a VLFM's zero-shot accuracy on a target domain using only a single labelled image per class. Our approach uses a Large Language Model to generate plausible counterfactual descriptions of a given image. By measuring the VLFM's ability to distinguish the correct description from these hard negatives, we engineer features that capture the VLFM's discriminative power in its shared embedding space. A linear regressor trained on these similarity scores estimates the VLFM's zero-shot test accuracy across various visual domains with a Pearson-r correlation of 0.96. We demonstrate our method's performance across five diverse datasets, including standard benchmark datasets and underrepresented datasets from Africa. Our work provides a low-cost, reliable tool for probing VLFMs, enabling researchers and practitioners to make informed decisions about data annotation efforts before committing significant resources. The model training code, generated captions and counterfactuals are released here: https://github.com/chris-vorster/PreLabellingProbe.

翻译：大规模视觉语言基础模型（VLFM），如CLIP，现已支撑起广泛的计算机视觉研究与应用。VLFM常被适配至多种领域特定任务。然而，VLFM在新型、专业化或代表性不足领域上的性能表现仍不稳定。评估VLFM通常需要带标注的测试集，但对于感兴趣的细分领域（尤其是来自全球南方的领域），此类数据集往往难以获取。为填补这一空白，我们提出一种高数据效率的方法，仅需每类单张标注图像即可预测VLFM在目标领域上的零样本准确率。我们的方法利用大型语言模型生成给定图像的合理反事实描述。通过衡量VLFM从这些困难负例中区分正确描述的能力，我们构建了能够捕捉其在共享嵌入空间中判别力的特征。基于这些相似度分数训练的线性回归器，可在多种视觉领域上以0.96的皮尔逊相关系数估计VLFM的零样本测试准确率。我们在五个多样化数据集上验证了方法的性能，包括标准基准数据集和来自非洲的代表性不足数据集。本工作为探测VLFM提供了一种低成本、可靠的工具，使研究者和实践者能在投入大量资源前，就数据标注工作做出明智决策。模型训练代码、生成描述及反事实数据发布于：https://github.com/chris-vorster/PreLabellingProbe。