Vision-language models, such as contrastive language-image pre-training (CLIP), have demonstrated impressive results in natural image domains. However, these models often struggle when applied to specialized domains like remote sensing, and adapting to such domains is challenging due to the limited number of image-text pairs available for training. To address this, we propose S-CLIP, a semi-supervised learning method for training CLIP that utilizes additional unpaired images. S-CLIP employs two pseudo-labeling strategies specifically designed for contrastive learning and the language modality. The caption-level pseudo-label is given by a combination of captions of paired images, obtained by solving an optimal transport problem between unpaired and paired images. The keyword-level pseudo-label is given by a keyword in the caption of the nearest paired image, trained through partial label learning that assumes a candidate set of labels for supervision instead of the exact one. By combining these objectives, S-CLIP significantly enhances the training of CLIP using only a few image-text pairs, as demonstrated in various specialist domains, including remote sensing, fashion, scientific figures, and comics. For instance, S-CLIP improves CLIP by 10% for zero-shot classification and 4% for image-text retrieval on the remote sensing benchmark, matching the performance of supervised CLIP while using three times fewer image-text pairs.
翻译:视觉-语言模型,例如对比语言-图像预训练(CLIP),在自然图像领域展现了卓越的性能。然而,这些模型在遥感等专业领域应用时往往表现不佳,且因可用于训练的图文对数量有限,迁移至此类领域颇具挑战。为此,我们提出S-CLIP——一种半监督学习方法,用于训练CLIP模型,该方法可利用额外的无配对图像。S-CLIP采用两种专为对比学习和语言模态设计的伪标签策略。标题级伪标签通过求解无配对图像与配对图像之间的最优传输问题,由配对图像标题的组合生成;关键词级伪标签则来自最近邻配对图像标题中的关键词,并通过部分标签学习训练——该方法仅依赖候选标签集而非确切标注进行监督。通过整合这些目标,S-CLIP仅需少量图文对即可显著提升CLIP的训练效果,在遥感、时尚、科学图表及漫画等多个专业领域得到验证。例如,在遥感基准测试中,S-CLIP将CLIP的零样本分类性能提升10%,图文检索性能提升4%,在使用三倍更少图文对的情况下达到监督学习CLIP的同等水平。