Generalized Category Discovery (GCD) aims to identify both known and unknown categories, with only partial labels given for the known categories, posing a challenging open-set recognition problem. State-of-the-art approaches for GCD task are usually built on multi-modality representation learning, which is heavily dependent upon inter-modality alignment. However, few of them cast a proper intra-modality alignment to generate a desired underlying structure of representation distributions. In this paper, we propose a novel and effective multi-modal representation learning framework for GCD via Semi-Supervised Rate Reduction, called SSR$^2$-GCD, to learn cross-modality representations with desired structural properties based on emphasizing to properly align intra-modality relationships. Moreover, to boost knowledge transfer, we integrate prompt candidates by leveraging the inter-modal alignment offered by Vision Language Models. We conduct extensive experiments on generic and fine-grained benchmark datasets demonstrating superior performance of our approach.
翻译:广义类别发现(GCD)旨在识别已知与未知类别,仅给定已知类别的部分标签,构成了一个具有挑战性的开放集识别问题。当前GCD任务的最先进方法通常建立在多模态表示学习之上,其高度依赖于模态间对齐。然而,这些方法大多未能实施恰当的模态内对齐以生成理想的表示分布底层结构。本文提出一种新颖且有效的多模态表示学习框架,通过半监督率降低(Semi-Supervised Rate Reduction)用于GCD,称为SSR$^2$-GCD,该框架基于强调正确对齐模态内关系,以学习具有理想结构特性的跨模态表示。此外,为促进知识迁移,我们通过利用视觉语言模型提供的模态间对齐来整合提示候选。我们在通用与细粒度基准数据集上进行了大量实验,结果表明所提方法具有优越性能。