Semi-supervised learning (SSL) offers a robust framework for harnessing the potential of unannotated data. Traditionally, SSL mandates that all classes possess labeled instances. However, the emergence of open-world SSL (OwSSL) introduces a more practical challenge, wherein unlabeled data may encompass samples from unseen classes. This scenario leads to misclassification of unseen classes as known ones, consequently undermining classification accuracy. To overcome this challenge, this study revisits two methodologies from self-supervised and semi-supervised learning, self-labeling and consistency, tailoring them to address the OwSSL problem. Specifically, we propose an effective framework called OwMatch, combining conditional self-labeling and open-world hierarchical thresholding. Theoretically, we analyze the estimation of class distribution on unlabeled data through rigorous statistical analysis, thus demonstrating that OwMatch can ensure the unbiasedness of the self-label assignment estimator with reliability. Comprehensive empirical analyses demonstrate that our method yields substantial performance enhancements across both known and unknown classes in comparison to previous studies. Code is available at https://github.com/niusj03/OwMatch.
翻译:半监督学习(SSL)为挖掘未标注数据的潜力提供了一个稳健的框架。传统上,SSL要求所有类别都拥有标注实例。然而,开放世界半监督学习(OwSSL)的出现引入了一个更实际的挑战,即未标注数据可能包含来自未见类别的样本。这种情况会导致将未见类别误判为已知类别,从而损害分类准确性。为克服这一挑战,本研究重新审视了自监督学习和半监督学习中的两种方法——自标记与一致性,并将其调整以解决OwSSL问题。具体而言,我们提出了一个名为OwMatch的有效框架,它结合了条件自标记与开放世界分层阈值化方法。理论上,我们通过严谨的统计分析,探讨了对未标注数据上类别分布的估计,从而证明OwMatch能够可靠地确保自标记分配估计量的无偏性。全面的实证分析表明,与先前研究相比,我们的方法在已知和未知类别上均能带来显著的性能提升。代码可在 https://github.com/niusj03/OwMatch 获取。