We tackle the issue of generalized category discovery (GCD). GCD considers the open-world problem of automatically clustering a partially labelled dataset, in which the unlabelled data contain instances from novel categories and also the labelled classes. In this paper, we address the GCD problem without a known category number in the unlabelled data. We propose a framework, named CiPR, to bootstrap the representation by exploiting Cross-instance Positive Relations for contrastive learning in the partially labelled data which are neglected in existing methods. First, to obtain reliable cross-instance relations to facilitate the representation learning, we introduce a semi-supervised hierarchical clustering algorithm, named selective neighbor clustering (SNC), which can produce a clustering hierarchy directly from the connected components in the graph constructed by selective neighbors. We also extend SNC to be capable of label assignment for the unlabelled instances with the given class number. Moreover, we present a method to estimate the unknown class number using SNC with a joint reference score considering clustering indexes of both labelled and unlabelled data. Finally, we thoroughly evaluate our framework on public generic image recognition datasets and challenging fine-grained datasets, all establishing the new state-of-the-art.
翻译:我们探讨了广义类别发现(GCD)问题。GCD关注开放世界中的自动聚类问题,即对部分标注数据集进行聚类,其中未标注数据既包含来自新类别的实例,也包含已标注类别的实例。本文针对未标注数据中类别数量未知的GCD问题提出了一种名为CiPR的框架,通过利用被现有方法忽视的部分标注数据中的跨实例正关系(Cross-instance Positive Relations),引导对比学习以提升表征能力。首先,为获得可靠的跨实例关系以促进表征学习,我们引入了一种半监督层次聚类算法——选择性邻域聚类(SNC)。该算法可直接从选择性邻域构建的图的连通分量中生成聚类层次结构。我们还扩展了SNC,使其能在给定类别数的情况下为未标注实例分配标签。此外,我们提出了一种基于SNC的未知类别数估计方法,通过联合参考分数综合考虑已标注数据和未标注数据的聚类指标。最后,我们在公开通用图像识别数据集和具有挑战性的细粒度数据集上全面评估了所提框架,所有结果均达到了当前最优水平。