We tackle the issue of generalized category discovery (GCD). GCD considers the open-world problem of automatically clustering a partially labelled dataset, in which the unlabelled data may contain instances from both novel categories and labelled classes. In this paper, we address the GCD problem with an unknown category number for the unlabelled data. We propose a framework, named CiPR, to bootstrap the representation by exploiting Cross-instance Positive Relations in the partially labelled data for contrastive learning, which have been neglected in existing methods. To obtain reliable cross-instance relations to facilitate representation learning, we introduce a semi-supervised hierarchical clustering algorithm, named selective neighbor clustering (SNC), which can produce a clustering hierarchy directly from the connected components of a graph constructed from selective neighbors. We further present a method to estimate the unknown class number using SNC with a joint reference score that considers clustering indexes of both labelled and unlabelled data, and extend SNC to allow label assignment for the unlabelled instances with a given class number. We thoroughly evaluate our framework on public generic image recognition datasets and challenging fine-grained datasets, and establish a new state-of-the-art. Code: https://github.com/haoosz/CiPR
翻译:我们针对广义类别发现(GCD)问题展开研究。GCD探讨了在部分标注数据集中自动聚类的开放世界问题,其中未标注数据可能同时包含来自新类别和已标注类别的实例。本文在未标注数据类别数未知的条件下解决GCD问题。我们提出名为CiPR的框架,通过利用部分标注数据中常被现有方法忽视的跨实例正例关系,基于对比学习增强表征。为获得可靠的跨实例关系以促进表征学习,我们引入半监督层次聚类算法——选择性邻居聚类(SNC),该算法可直接从选择性邻居构建的图的连通分量中生成聚类层级。我们进一步提出利用SNC及联合参考评分(综合考虑标注与未标注数据的聚类指标)来估计未知类别数的方法,并扩展SNC使其能在给定类别数条件下为未标注实例分配标签。我们在通用图像识别数据集与极具挑战性的细粒度数据集上全面评估了该框架,并创下新的最优性能。代码:https://github.com/haoosz/CiPR