Generalized Category Discovery (GCD) aims to discover novel categories in unlabelled datasets using knowledge learned from labelled samples. Previous studies argued that parametric classifiers are prone to overfitting to seen categories, and endorsed using a non-parametric classifier formed with semi-supervised k-means. However, in this study, we investigate the failure of parametric classifiers, verify the effectiveness of previous design choices when high-quality supervision is available, and identify unreliable pseudo-labels as a key problem. We demonstrate that two prediction biases exist: the classifier tends to predict seen classes more often, and produces an imbalanced distribution across seen and novel categories. Based on these findings, we propose a simple yet effective parametric classification method that benefits from entropy regularisation, achieves state-of-the-art performance on multiple GCD benchmarks and shows strong robustness to unknown class numbers. We hope the investigation and proposed simple framework can serve as a strong baseline to facilitate future studies in this field. Our code is available at: https://github.com/CVMI-Lab/SimGCD.
翻译:广义类别发现(GCD)旨在利用从标记样本中习得的知识,在未标记数据集中发现新类别。此前研究认为参数化分类器易对已知类别过拟合,并主张采用基于半监督k-means的非参数分类器。然而,本研究通过探究参数化分类器的失效机制,验证了在高质监督条件下既往设计选择的有效性,并揭示不可靠伪标签是核心问题。我们论证了存在两类预测偏差:分类器倾向于更频繁地预测已知类别,同时导致已知类别与新类别之间的预测分布失衡。基于上述发现,我们提出一种简洁有效的参数化分类方法,该方法借助熵正则化机制,在多个GCD基准测试中取得最优性能,并对未知类别数量表现出强鲁棒性。我们期望本项研究及所提出的简洁框架能够作为强有力的基线,推动该领域的未来研究。我们的代码已发布于:https://github.com/CVMI-Lab/SimGCD。