Vision-language models like Contrastive Language-Image Pre-Training (CLIP) have been extensively studied in data-scarce scenarios. A particularly challenging and realistic task in this area is online zero-shot learning with CLIP, where unknown test samples are predicted sequentially in random order by CLIP while keeping the feature extraction and model parameters fixed during the sequential inference phase. Most existing approaches in this setting address the problem by adapting representations online using incoming test samples, while neglecting the distribution of the data on which CLIP was initially trained. This mismatch can lead to degraded performance when the label distribution in the test data differs from that of the training domain. To address this gap, we propose Label Shift Aware (LSA), which formulates the online zero-shot classification task as a domain adaptation problem. Specifically, LSA adapts the predictions computed by CLIP, which was trained on an unknown source distribution, to a target distribution using only unlabeled test data, and applies label shift correction to mitigate the mismatch between the source and target domains. The extensive experiments across multiple datasets demonstrate that the proposed LSA consistently outperforms state-of-the-art online zero-shot learning methods based on CLIP.
翻译:视觉-语言模型如对比语言-图像预训练(CLIP)已在数据稀缺场景中受到广泛研究。其中一项极具挑战性且贴近实际的任务是基于CLIP的在线零样本学习:在顺序推理阶段,未知测试样本以随机顺序依次输入CLIP模型进行预测,同时保持特征提取与模型参数固定不变。现有方法普遍通过在线利用测试样本调整表征来解决该问题,却忽略了CLIP初始训练数据的分布特性。当测试数据与训练域的标签分布存在差异时,这种失配将导致性能下降。为弥补这一缺陷,我们提出标签偏移感知(LSA)方法,将在线零样本分类任务建模为域自适应问题。具体而言,LSA方法基于仅含无标签测试数据的目标分布,对在未知源分布上训练的CLIP模型输出进行自适应调整,并通过标签偏移校正技术缓解源域与目标域之间的失配。在多个数据集上的广泛实验表明,所提出的LSA方法在基于CLIP的在线零样本学习方法中持续超越现有最优性能。