The problem of Novel Class Discovery (NCD) consists in extracting knowledge from a labeled set of known classes to accurately partition an unlabeled set of novel classes. While NCD has recently received a lot of attention from the community, it is often solved on computer vision problems and under unrealistic conditions. In particular, the number of novel classes is usually assumed to be known in advance, and their labels are sometimes used to tune hyperparameters. Methods that rely on these assumptions are not applicable in real-world scenarios. In this work, we focus on solving NCD in tabular data when no prior knowledge of the novel classes is available. To this end, we propose to tune the hyperparameters of NCD methods by adapting the $k$-fold cross-validation process and hiding some of the known classes in each fold. Since we have found that methods with too many hyperparameters are likely to overfit these hidden classes, we define a simple deep NCD model. This method is composed of only the essential elements necessary for the NCD problem and performs impressively well under realistic conditions. Furthermore, we find that the latent space of this method can be used to reliably estimate the number of novel classes. Additionally, we adapt two unsupervised clustering algorithms ($k$-means and Spectral Clustering) to leverage the knowledge of the known classes. Extensive experiments are conducted on 7 tabular datasets and demonstrate the effectiveness of the proposed method and hyperparameter tuning process, and show that the NCD problem can be solved without relying on knowledge from the novel classes.
翻译:新类别发现(NCD)问题旨在从已知类别的有标签数据集中提取知识,以准确划分未标记的新类别集合。尽管NCD近期受到学界广泛关注,但其研究多集中于计算机视觉领域,且常在非现实条件下进行。具体而言,新类别的数量通常被预设为已知参数,其标签有时甚至被用于超参数调优。依赖这些假设的方法无法适用于实际应用场景。本研究致力于在缺乏新类别先验知识的情况下,解决表格数据中的NCD问题。为此,我们提出通过改进$k$折交叉验证流程来调整NCD方法的超参数,即在每一折中隐去部分已知类别。由于我们发现超参数过多的方法容易对隐藏类别产生过拟合,因此设计了一个简洁的深度NCD模型。该方法仅包含解决NCD问题所必需的核心组件,在现实条件下表现出卓越性能。此外,我们发现该方法的潜在空间可用于可靠估计新类别的数量。同时,我们改进了两种无监督聚类算法($k$-均值和谱聚类),使其能够有效利用已知类别的知识。在7个表格数据集上的大量实验证明了所提方法及超参数调优流程的有效性,并表明NCD问题可在不依赖新类别知识的前提下得到解决。