Contrastive language-image pre-training (CLIP) has demonstrated remarkable zero-shot classification ability, namely image classification using novel text labels. Existing works have attempted to enhance CLIP by fine-tuning on downstream tasks, but these have inadvertently led to performance degradation on unseen classes, thus harming zero-shot generalization. This paper aims to address this challenge by leveraging readily available image-text pairs from an external dataset for cross-modal guidance during inference. To this end, we propose X-MoRe, a novel inference method comprising two key steps: (1) cross-modal retrieval and (2) modal-confidence-based ensemble. Given a query image, we harness the power of CLIP's cross-modal representations to retrieve relevant textual information from an external image-text pair dataset. Then, we assign higher weights to the more reliable modality between the original query image and retrieved text, contributing to the final prediction. X-MoRe demonstrates robust performance across a diverse set of tasks without the need for additional training, showcasing the effectiveness of utilizing cross-modal features to maximize CLIP's zero-shot ability.
翻译:对比语言-图像预训练(CLIP)已展现出卓越的零样本分类能力,即利用新颖文本标签进行图像分类。现有研究尝试通过在下游任务上微调来增强CLIP,但这会导致在未见类别上性能下降,从而损害零样本泛化能力。本文旨在利用外部数据集中的现成图像-文本对,在推理过程中提供跨模态指导来解决这一挑战。为此,我们提出X-MoRe——一种新颖的推理方法,包含两个关键步骤:(1)跨模态检索和(2)基于模态置信度的集成。给定查询图像,我们利用CLIP的跨模态表示能力,从外部图像-文本对数据集中检索相关文本信息。然后,我们在原始查询图像与检索文本之间,对更可靠的模态赋予更高权重,以生成最终预测。X-MoRe在无需额外训练的条件下,在多样化任务中展现出稳健性能,证明了利用跨模态特征最大化CLIP零样本能力的有效性。