Foundation models for vision have transformed visual recognition with powerful pretrained representations and strong zero-shot capabilities, yet their potential for data-efficient learning remains largely untapped. Active Learning (AL) aims to minimize annotation costs by strategically selecting the most informative samples for labeling, but existing methods largely overlook the rich multimodal knowledge embedded in modern vision-language models (VLMs). We introduce Conformal Cross-Modal Acquisition (CCMA), a novel AL framework that bridges vision and language modalities through a teacher-student architecture. CCMA employs a pretrained VLM as a teacher to provide semantically grounded uncertainty estimates, conformally calibrated to guide sample selection for a vision-only student model. By integrating multimodal conformal scoring with diversity-aware selection strategies, CCMA achieves superior data efficiency across multiple benchmarks. Our approach consistently outperforms state-of-the-art AL baselines, demonstrating clear advantages over methods relying solely on uncertainty or diversity metrics.
翻译:视觉基础模型通过强大的预训练表征和卓越的零样本能力,深刻变革了视觉识别领域,但其在数据高效学习方面的潜力尚未得到充分开发。主动学习(AL)旨在通过战略性选取最具信息量的样本进行标注来最小化标注成本,但现有方法大多忽视了现代视觉语言模型(VLM)中蕴含的丰富多模态知识。我们提出跨模态温变样本获取(CCMA)——一种通过师生架构连接视觉与语言模态的新型主动学习框架。CCMA采用预训练VLM作为教师模型,提供语义驱动的、经温变校准的不确定性估计,以指导纯视觉学生模型的样本选择。通过将多模态温变评分与多样性感知选择策略相结合,CCMA在多个基准测试中实现了卓越的数据效率。我们的方法始终优于最先进的主动学习基线,展现出相较于仅依赖不确定性或多样性指标的绝度优势。