In this paper, we study the problem of Generalized Category Discovery (GCD), which aims to cluster unlabeled data from both known and unknown categories using the knowledge of labeled data from known categories. Current GCD methods rely on only visual cues, which however neglect the multi-modality perceptive nature of human cognitive processes in discovering novel visual categories. To address this, we propose a two-phase TextGCD framework to accomplish multi-modality GCD by exploiting powerful Visual-Language Models. TextGCD mainly includes a retrieval-based text generation (RTG) phase and a cross-modality co-teaching (CCT) phase. First, RTG constructs a visual lexicon using category tags from diverse datasets and attributes from Large Language Models, generating descriptive texts for images in a retrieval manner. Second, CCT leverages disparities between textual and visual modalities to foster mutual learning, thereby enhancing visual GCD. In addition, we design an adaptive class aligning strategy to ensure the alignment of category perceptions between modalities as well as a soft-voting mechanism to integrate multi-modality cues. Experiments on eight datasets show the large superiority of our approach over state-of-the-art methods. Notably, our approach outperforms the best competitor, by 7.7% and 10.8% in All accuracy on ImageNet-1k and CUB, respectively.
翻译:本文研究广义类别发现(GCD)问题,其目标在于利用已知类别标注数据的知识,对来自已知与未知类别的未标注数据进行聚类。现有GCD方法仅依赖视觉线索,然而这忽视了人类在发现新颖视觉类别时多模态感知的认知特性。为解决此问题,我们提出两阶段TextGCD框架,通过利用强大的视觉-语言模型实现多模态GCD。TextGCD主要包括基于检索的文本生成(RTG)阶段和跨模态协同教学(CCT)阶段。首先,RTG利用来自多样化数据集的类别标签及大语言模型的属性构建视觉词典,以检索方式为图像生成描述性文本。其次,CCT利用文本与视觉模态间的差异性促进相互学习,从而提升视觉GCD性能。此外,我们设计了自适应类别对齐策略以确保模态间类别感知的一致性,并采用软投票机制整合多模态线索。在八个数据集上的实验表明,本方法显著优于现有最优方法。值得注意的是,在ImageNet-1k和CUB数据集上,我们的方法在整体准确率(All accuracy)指标上分别以7.7%和10.8%的优势超越最佳基线模型。