Prior works have shown that semi-supervised learning algorithms can leverage unlabeled data to improve over the labeled sample complexity of supervised learning (SL) algorithms. However, existing theoretical analyses focus on regimes where the unlabeled data is sufficient to learn a good decision boundary using unsupervised learning (UL) alone. This begs the question: Can SSL algorithms simultaneously improve upon both UL and SL? To this end, we derive a tight lower bound for 2-Gaussian mixture models that explicitly depends on the labeled and the unlabeled dataset size as well as the signal-to-noise ratio of the mixture distribution. Surprisingly, our result implies that no SSL algorithm can improve upon the minimax-optimal statistical error rates of SL or UL algorithms for these distributions. Nevertheless, we show empirically on real-world data that SSL algorithms can still outperform UL and SL methods. Therefore, our work suggests that, while proving performance gains for SSL algorithms is possible, it requires careful tracking of constants.
翻译:先前研究表明,半监督学习算法能够利用无标签数据,在监督学习算法的标签样本复杂度上实现改进。然而,现有理论分析主要关注无标签数据足以通过无监督学习独立学习出良好决策边界的场景。这引发了一个问题:半监督学习算法能否同时改进无监督学习和监督学习?为此,我们针对二维高斯混合模型推导出一个紧下界,该下界显式依赖于标签和无标签数据集规模以及混合分布的信噪比。令人惊讶的是,我们的结果表明,对于这些分布,没有任何半监督学习算法能够改进监督学习或无监督学习算法的极小化最优统计错误率。尽管如此,我们在真实数据上的实验表明,半监督学习算法仍能优于无监督学习和监督学习方法。因此,我们的研究提示,虽然证明半监督学习算法的性能提升是可能的,但需要仔细追踪常数项。