The problem of Novel Class Discovery (NCD) consists in extracting knowledge from a labeled set of known classes to accurately partition an unlabeled set of novel classes. While NCD has recently received a lot of attention from the community, it is often solved on computer vision problems and under unrealistic conditions. In particular, the number of novel classes is usually assumed to be known in advance, and their labels are sometimes used to tune hyperparameters. Methods that rely on these assumptions are not applicable in real-world scenarios. In this work, we focus on solving NCD in tabular data when no prior knowledge of the novel classes is available. To this end, we propose to tune the hyperparameters of NCD methods by adapting the $k$-fold cross-validation process and hiding some of the known classes in each fold. Since we have found that methods with too many hyperparameters are likely to overfit these hidden classes, we define a simple deep NCD model. This method is composed of only the essential elements necessary for the NCD problem and performs impressively well under realistic conditions. Furthermore, we find that the latent space of this method can be used to reliably estimate the number of novel classes. Additionally, we adapt two unsupervised clustering algorithms ($k$-means and Spectral Clustering) to leverage the knowledge of the known classes. Extensive experiments are conducted on 7 tabular datasets and demonstrate the effectiveness of the proposed method and hyperparameter tuning process, and show that the NCD problem can be solved without relying on knowledge from the novel classes.
翻译:新类别发现(NCD)问题旨在从已知类别的标注数据集中提取知识,以准确划分未标注的新类别集合。尽管NCD近期受到学界广泛关注,但该问题通常仅在计算机视觉领域且在非现实条件下被解决。具体而言,现有研究常预先假定新类别数量已知,甚至利用其标签调整超参数。依赖这些假设的方法无法适用于真实场景。本文聚焦于在无新类别先验知识的情况下解决表格数据中的NCD问题。为此,我们提出通过改进$k$折交叉验证过程并在每折中隐藏部分已知类别,来调整NCD方法的超参数。由于我们发现超参数过多的方法容易对这些隐藏类别过拟合,因此定义了一个简洁的深度NCD模型。该方法仅包含NCD问题所需的必要组件,并在现实条件下表现出色。此外,该方法隐含空间可用于可靠估计新类别数量。同时,我们改进两种无监督聚类算法($k$-means和谱聚类)以利用已知类别知识。基于7个表格数据集的广泛实验表明,所提方法与超参数调优流程的有效性,证实NCD问题无需依赖新类别知识即可解决。