Semi-supervised learning (SSL) has shown great promise in leveraging unlabeled data to improve model performance. While standard SSL assumes uniform data distribution, we consider a more realistic and challenging setting called imbalanced SSL, where imbalanced class distributions occur in both labeled and unlabeled data. Although there are existing endeavors to tackle this challenge, their performance degenerates when facing severe imbalance since they can not reduce the class imbalance sufficiently and effectively. In this paper, we study a simple yet overlooked baseline -- SimiS -- which tackles data imbalance by simply supplementing labeled data with pseudo-labels, according to the difference in class distribution from the most frequent class. Such a simple baseline turns out to be highly effective in reducing class imbalance. It outperforms existing methods by a significant margin, e.g., 12.8%, 13.6%, and 16.7% over previous SOTA on CIFAR100-LT, FOOD101-LT, and ImageNet127 respectively. The reduced imbalance results in faster convergence and better pseudo-label accuracy of SimiS. The simplicity of our method also makes it possible to be combined with other re-balancing techniques to improve the performance further. Moreover, our method shows great robustness to a wide range of data distributions, which holds enormous potential in practice. Code will be publicly available.
翻译:半监督学习(SSL)在利用无标签数据提升模型性能方面展现出巨大潜力。虽然标准SSL假设数据分布均匀,但我们考虑了一个更现实且更具挑战性的场景——不平衡半监督学习,其中标签数据与无标签数据均存在类别分布不均衡问题。尽管已有研究尝试解决这一挑战,但由于无法充分且有效地降低类别不平衡程度,它们在面对严重不平衡时性能会退化。本文研究了一个简单却常被忽视的基线方法——SimiS,该方法通过根据类别分布与最频繁类别的差异,简单地为标签数据补充伪标签来解决数据不平衡问题。这一简单的基线方法在降低类别不平衡方面表现出色。它在CIFAR100-LT、FOOD101-LT和ImageNet127数据集上分别以12.8%、13.6%和16.7%的显著优势超越了现有最优方法。降低的不平衡程度使SimiS实现了更快的收敛速度和更优的伪标签准确性。我们方法的简洁性使其能够与其他重平衡技术结合以进一步提升性能。此外,该方法对广泛的数据分布展现出强大的鲁棒性,在实际应用中具有巨大潜力。代码将公开发布。