A Closer Look at Novel Class Discovery from the Labeled Set

Novel class discovery (NCD) aims to infer novel categories in an unlabeled dataset leveraging prior knowledge of a labeled set comprising disjoint but related classes. Existing research focuses primarily on utilizing the labeled set at the methodological level, with less emphasis on the analysis of the labeled set itself. Thus, in this paper, we rethink novel class discovery from the labeled set and focus on two core questions: (i) Given a specific unlabeled set, what kind of labeled set can best support novel class discovery? (ii) A fundamental premise of NCD is that the labeled set must be related to the unlabeled set, but how can we measure this relation? For (i), we propose and substantiate the hypothesis that NCD could benefit more from a labeled set with a large degree of semantic similarity to the unlabeled set. Specifically, we establish an extensive and large-scale benchmark with varying degrees of semantic similarity between labeled/unlabeled datasets on ImageNet by leveraging its hierarchical class structure. As a sharp contrast, the existing NCD benchmarks are developed based on labeled sets with different number of categories and images, and completely ignore the semantic relation. For (ii), we introduce a mathematical definition for quantifying the semantic similarity between labeled and unlabeled sets. In addition, we use this metric to confirm the validity of our proposed benchmark and demonstrate that it highly correlates with NCD performance. Furthermore, without quantitative analysis, previous works commonly believe that label information is always beneficial. However, counterintuitively, our experimental results show that using labels may lead to sub-optimal outcomes in low-similarity settings.

翻译：新类别发现（NCD）旨在利用包含不重叠但相关类别的标注集所蕴含的先验知识，推断无标注数据集中的新类别。现有研究主要关注在方法论层面利用标注集，而对标注集本身的分析重视不足。因此，本文从标注集视角重新审视新类别发现，并聚焦两个核心问题：(i) 针对特定无标注集，何种标注集最能支持新类别发现？(ii) NCD的基本前提是标注集必须与无标注集相关，但如何量化这种相关性？针对问题(i)，我们提出并验证了假设：NCD能从与无标注集具有高度语义相似性的标注集中获益更多。具体而言，我们利用ImageNet的层次化类别结构，构建了标注集/无标注集间语义相似度可系统变化的大规模基准数据集。与现有NCD基准数据集仅关注标注集的类别数与图像数量、完全忽略语义关联形成鲜明对比。针对问题(ii)，我们引入量化标注集与无标注集语义相似度的数学定义。此外，我们利用该度量验证了所提出基准的有效性，并证明其与NCD性能高度相关。进一步地，以往研究未经过定量分析便普遍认为标签信息总是有益的。然而，反直觉的是，我们的实验结果表明在低相似度场景下使用标签可能导致次优结果。