Given unlabelled datasets containing both old and new categories, generalized category discovery (GCD) aims to accurately discover new classes while correctly classifying old classes, leveraging the class concepts learned from labeled samples. Current GCD methods only use a single visual modality of information, resulting in poor classification of visually similar classes. Though certain classes are visually confused, their text information might be distinct, motivating us to introduce text information into the GCD task. However, the lack of class names for unlabelled data makes it impractical to utilize text information. To tackle this challenging problem, in this paper, we propose a Text Embedding Synthesizer (TES) to generate pseudo text embeddings for unlabelled samples. Specifically, our TES leverages the property that CLIP can generate aligned vision-language features, converting visual embeddings into tokens of the CLIP's text encoder to generate pseudo text embeddings. Besides, we employ a dual-branch framework, through the joint learning and instance consistency of different modality branches, visual and semantic information mutually enhance each other, promoting the interaction and fusion of visual and text embedding space. Our method unlocks the multi-modal potentials of CLIP and outperforms the baseline methods by a large margin on all GCD benchmarks, achieving new state-of-the-art. The code will be released at \url{https://github.com/enguangW/GET}.
翻译:给定包含旧类和新类的无标注数据集,广义类别发现(GCD)旨在利用从标注样本中学习到的类别概念,准确发现新类别并正确分类旧类别。当前的GCD方法仅使用单一视觉模态信息,导致对视觉相似类别的分类效果较差。尽管某些视觉类别容易混淆,但其文本信息可能具有区分性,这促使我们将文本信息引入GCD任务。然而,无标注数据缺乏类别名称使得利用文本信息变得困难。为解决这一难题,本文提出文本嵌入合成器(TES),为无标注样本生成伪文本嵌入。具体而言,我们的TES利用CLIP能够生成对齐的视觉-语言特征这一特性,将视觉嵌入转换为CLIP文本编码器的令牌,从而生成伪文本嵌入。此外,我们采用双分支框架,通过不同模态分支的联合学习与实例一致性,使视觉与语义信息相互增强,促进视觉与文本嵌入空间的交互与融合。我们的方法解锁了CLIP的多模态潜力,在所有GCD基准测试中大幅超越基线方法,达到新的最优水平。代码将发布于\url{https://github.com/enguangW/GET}。