Recent advancements in semi-supervised learning have focused on a more realistic yet challenging task: addressing imbalances in labeled data while the class distribution of unlabeled data remains both unknown and potentially mismatched. Current approaches in this sphere often presuppose rigid assumptions regarding the class distribution of unlabeled data, thereby limiting the adaptability of models to only certain distribution ranges. In this study, we propose a novel approach, introducing a highly adaptable framework, designated as SimPro, which does not rely on any predefined assumptions about the distribution of unlabeled data. Our framework, grounded in a probabilistic model, innovatively refines the expectation-maximization (EM) algorithm by explicitly decoupling the modeling of conditional and marginal class distributions. This separation facilitates a closed-form solution for class distribution estimation during the maximization phase, leading to the formulation of a Bayes classifier. The Bayes classifier, in turn, enhances the quality of pseudo-labels in the expectation phase. Remarkably, the SimPro framework not only comes with theoretical guarantees but also is straightforward to implement. Moreover, we introduce two novel class distributions broadening the scope of the evaluation. Our method showcases consistent state-of-the-art performance across diverse benchmarks and data distribution scenarios. Our code is available at https://github.com/LeapLabTHU/SimPro.
翻译:近年来,半监督学习的研究聚焦于一个更具现实意义且更具挑战性的任务:在标记数据存在类别不平衡,同时未标记数据的类别分布未知且可能不匹配的情况下进行处理。当前该领域的方法通常对未标记数据的类别分布施加刚性假设,从而将模型的适应性限制在特定分布范围内。本研究提出了一种新颖方法,即引入名为SimPro的高度自适应框架,该框架不依赖于关于未标记数据分布的任何预设假设。我们的框架基于概率模型,通过明确解耦条件类别分布和边缘类别分布的建模,创新性地改进了期望最大化(EM)算法。这种分离使得在最大化阶段能够通过闭式解估计类别分布,从而推导出贝叶斯分类器。该贝叶斯分类器进而提升了期望阶段伪标签的质量。值得注意的是,SimPro框架不仅具有理论保证,而且实现简单。此外,我们引入了两种新的类别分布,以拓展评估范围。我们的方法在多种基准测试和数据分布场景下均展现出持续的最优性能。代码已开源至https://github.com/LeapLabTHU/SimPro。