Recently, many studies have been conducted to enhance the zero-shot generalization ability of vision-language models (e.g., CLIP) by addressing the semantic misalignment between image and text embeddings in downstream tasks. Although many efforts have been made, existing methods barely consider the fact that a class of images can be described by notably different textual concepts due to well-known lexical variation in natural language processing, which heavily affects the zero-shot generalization of CLIP. Therefore, this paper proposes a \textbf{S}ynonymous \textbf{S}emantic \textbf{S}pace ($S^3$) for each image class, rather than relying on a single textual concept, achieving more stable semantic alignment and improving the zero-shot generalization of CLIP. Specifically, our $S^3$ method first generates several synonymous concepts based on the label of each class by using large language models, and constructs a continuous yet compact synonymous semantic space based on the Vietoris-Rips complex of the generated synonymous concepts. Furthermore, we explore the effect of several point-to-space metrics on our $S^3$, while presenting a point-to-local-center metric to compute similarity between image embeddings and the synonymous semantic space of each class, accomplishing effective zero-shot predictions. Extensive experiments are conducted across 17 benchmarks, including fine-grained zero-shot classification, natural distribution zero-shot classification, and open-vocabulary segmentation, and the results show that our $S^3$ outperforms state-of-the-art methods.
翻译:近年来,许多研究致力于通过解决下游任务中图像与文本嵌入之间的语义错位问题,以增强视觉-语言模型(如CLIP)的零样本泛化能力。尽管已有诸多努力,现有方法却很少考虑到:由于自然语言处理中众所周知的词汇变异,一类图像可能由显著不同的文本概念所描述,这一事实严重影响了CLIP的零样本泛化。因此,本文为每个图像类别提出了一个\textbf{同义语义空间}($S^3$),而非依赖单一的文本概念,从而实现更稳定的语义对齐并提升CLIP的零样本泛化能力。具体而言,我们的$S^3$方法首先利用大语言模型基于每个类别的标签生成若干同义概念,并基于所生成同义概念的Vietoris-Rips复形构建一个连续且紧凑的同义语义空间。此外,我们探究了多种点对空间度量在$S^3$上的效果,同时提出了一种点对局部中心的度量方式,用于计算图像嵌入与每个类别的同义语义空间之间的相似性,从而实现有效的零样本预测。我们在17个基准测试上进行了广泛实验,包括细粒度零样本分类、自然分布零样本分类以及开放词汇分割,结果表明我们的$S^3$方法优于现有最先进方法。