In open-world semi-supervised learning, a machine learning model is tasked with uncovering novel categories from unlabeled data while maintaining performance on seen categories from labeled data. The central challenge is the substantial learning gap between seen and novel categories, as the model learns the former faster due to accurate supervisory information. To address this, we introduce 1) an adaptive margin loss based on estimated class distribution, which encourages a large negative margin for samples in seen classes, to synchronize learning paces, and 2) pseudo-label contrastive clustering, which pulls together samples which are likely from the same class in the output space, to enhance novel class discovery. Our extensive evaluations on multiple datasets demonstrate that existing models still hinder novel class learning, whereas our approach strikingly balances both seen and novel classes, achieving a remarkable 3% average accuracy increase on the ImageNet dataset compared to the prior state-of-the-art. Additionally, we find that fine-tuning the self-supervised pre-trained backbone significantly boosts performance over the default in prior literature. After our paper is accepted, we will release the code.
翻译:在开放世界半监督学习中,机器学习模型需从无标签数据中发现新类别,同时保持对带标签数据中已知类别的识别能力。其核心挑战在于已知类别与新类别之间存在显著的学习差距——由于精确的监督信息,模型对前者的学习速度更快。为解决此问题,我们提出:1)基于估计类别分布的自适应边界损失,通过对已知类别样本施加较大负边界以同步学习步调;2)伪标签对比聚类,在输出空间中聚合可能属于同类的样本以增强新类别发现。在多个数据集上的广泛评估表明,现有模型仍会阻碍新类别学习,而我们的方法显著平衡了已知与新类别的表现,在ImageNet数据集上相较先前最优方法实现了3%的平均准确率提升。此外,我们发现微调自监督预训练骨干网络相比默认设置能显著提升性能。论文接收后,我们将公开代码。