Recent advancements in semi-supervised learning have focused on a more realistic yet challenging task: addressing imbalances in labeled data while the class distribution of unlabeled data remains both unknown and potentially mismatched. Current approaches in this sphere often presuppose rigid assumptions regarding the class distribution of unlabeled data, thereby limiting the adaptability of models to only certain distribution ranges. In this study, we propose a novel approach, introducing a highly adaptable framework, designated as SimPro, which does not rely on any predefined assumptions about the distribution of unlabeled data. Our framework, grounded in a probabilistic model, innovatively refines the expectation-maximization (EM) algorithm by explicitly decoupling the modeling of conditional and marginal class distributions. This separation facilitates a closed-form solution for class distribution estimation during the maximization phase, leading to the formulation of a Bayes classifier. The Bayes classifier, in turn, enhances the quality of pseudo-labels in the expectation phase. Remarkably, the SimPro framework not only comes with theoretical guarantees but also is straightforward to implement. Moreover, we introduce two novel class distributions broadening the scope of the evaluation. Our method showcases consistent state-of-the-art performance across diverse benchmarks and data distribution scenarios. Our code is available at https://github.com/LeapLabTHU/SimPro.
翻译:近期半监督学习的研究进展聚焦于一项更具现实性且富有挑战性的任务:在标注数据存在类别不平衡、且未标注数据的类别分布未知且可能不匹配的情况下进行学习。当前该领域的方法通常对未标注数据的类别分布预设刚性假设,从而将模型的适应性限制在特定分布范围内。本研究提出一种新颖方法——SimPro,这是一个无需对未标注数据分布做任何预设条件的高适应性框架。该框架基于概率模型,通过显式解耦条件类分布与边缘类分布的建模,创新性地改进了期望最大化(EM)算法。这种分离机制使得最大化阶段能够通过闭式解对类别分布进行估计,进而构建出贝叶斯分类器。该贝叶斯分类器又能在期望阶段提升伪标签的质量。值得注意的是,SimPro框架不仅具备理论保证,而且实现简洁。此外,我们引入两种新型类别分布以拓展评估范围。该方法在多种基准测试与数据分布场景下均展现出持续领先的性能。代码开源地址:https://github.com/LeapLabTHU/SimPro。