Foundation vision or vision-language models are trained on large unlabeled or noisy data and learn robust representations that can achieve impressive zero- or few-shot performance on diverse tasks. Given these properties, they are a natural fit for active learning (AL), which aims to maximize labeling efficiency, but the full potential of foundation models has not been explored in the context of AL, specifically in the low-budget regime. In this work, we evaluate how foundation models influence three critical components of effective AL, namely, 1) initial labeled pool selection, 2) ensuring diverse sampling, and 3) the trade-off between representative and uncertainty sampling. We systematically study how the robust representations of foundation models (DINOv2, OpenCLIP) challenge existing findings in active learning. Our observations inform the principled construction of a new simple and elegant AL strategy that balances uncertainty estimated via dropout with sample diversity. We extensively test our strategy on many challenging image classification benchmarks, including natural images as well as out-of-domain biomedical images that are relatively understudied in the AL literature. Source code will be made available.
翻译:基础视觉模型或视觉语言模型在大规模未标注或噪声数据上训练,能够学习到鲁棒的表示,从而在多样化任务中实现令人印象深刻的零样本或少样本性能。基于这些特性,它们天然适用于旨在最大化标注效率的主动学习(AL),然而基础模型在AL中的全部潜力尚未被充分探索,特别是在低预算场景下。本研究评估了基础模型如何影响有效主动学习的三个关键组成部分:1)初始标注池的选择,2)确保多样化的采样,以及3)代表性采样与不确定性采样之间的权衡。我们系统性地研究了基础模型(DINOv2、OpenCLIP)的鲁棒表示如何挑战主动学习中的现有发现。这些观察结果为我们原则性地构建一种简单而优雅的新型AL策略提供了依据,该策略通过dropout估计的不确定性与样本多样性之间实现平衡。我们在许多具有挑战性的图像分类基准上广泛测试了该策略,包括自然图像以及主动学习文献中相对研究不足的域外生物医学图像。源代码将公开发布。