A Practical Approach to Novel Class Discovery in Tabular Data

The problem of Novel Class Discovery (NCD) consists in extracting knowledge from a labeled set of known classes to accurately partition an unlabeled set of novel classes. While NCD has recently received a lot of attention from the community, it is often solved on computer vision problems and under unrealistic conditions. In particular, the number of novel classes is usually assumed to be known in advance, and their labels are sometimes used to tune hyperparameters. Methods that rely on these assumptions are not applicable in real-world scenarios. In this work, we focus on solving NCD in tabular data when no prior knowledge of the novel classes is available. To this end, we propose to tune the hyperparameters of NCD methods by adapting the $k$-fold cross-validation process and hiding some of the known classes in each fold. Since we have found that methods with too many hyperparameters are likely to overfit these hidden classes, we define a simple deep NCD model. This method is composed of only the essential elements necessary for the NCD problem and performs impressively well under realistic conditions. Furthermore, we find that the latent space of this method can be used to reliably estimate the number of novel classes. Additionally, we adapt two unsupervised clustering algorithms ($k$-means and Spectral Clustering) to leverage the knowledge of the known classes. Extensive experiments are conducted on 7 tabular datasets and demonstrate the effectiveness of the proposed method and hyperparameter tuning process, and show that the NCD problem can be solved without relying on knowledge from the novel classes.

翻译：新型类别发现（Novel Class Discovery, NCD）问题旨在从已知类别的标记数据中提取知识，从而准确划分未标记的新类别数据。尽管NCD近期受到学界的广泛关注，但现有研究通常基于计算机视觉问题且在不现实的条件下进行求解。特别是，新型类别的数量通常被假定为预先已知，且其标签有时被用于调整超参数。依赖这些假设的方法无法应用于真实场景。本文聚焦于在无任何新型类别先验知识的情况下，解决表格数据中的NCD问题。为此，我们通过自适应$k$折交叉验证流程并在每折中隐藏部分已知类别，提出了一种针对NCD方法的超参数调优方案。由于我们发现超参数过多的方法容易对隐藏类别过拟合，因此设计了一个简单的深度NCD模型。该方法仅包含NCD问题所需的必要元素，在现实条件下表现极为出色。此外，我们发现该方法的潜在空间可用于可靠估计新型类别的数量。同时，我们改进了两种无监督聚类算法（$k$-均值和谱聚类），以充分利用已知类别的知识。在7个表格数据集上的大量实验验证了所提方法及超参数调优流程的有效性，并表明NCD问题可在不依赖新型类别知识的前提下得到解决。