Self-training (ST) is a simple yet effective semi-supervised learning method. However, why and how ST improves generalization performance by using potentially erroneous pseudo-labels is still not well understood. To deepen the understanding of ST, we derive and analyze a sharp characterization of the behavior of iterative ST when training a linear classifier by minimizing the ridge-regularized convex loss on binary Gaussian mixtures, in the asymptotic limit where input dimension and data size diverge proportionally. The results show that ST improves generalization in different ways depending on the number of iterations. When the number of iterations is small, ST improves generalization performance by fitting the model to relatively reliable pseudo-labels and updating the model parameters by a large amount at each iteration. This suggests that ST works intuitively. On the other hand, with many iterations, ST can gradually improve the direction of the classification plane by updating the model parameters incrementally, using soft labels and small regularization. It is argued that this is because the small update of ST can extract information from the data in an almost noiseless way. However, in the presence of label imbalance, the generalization performance of ST underperforms supervised learning with true labels. To overcome this, two heuristics are proposed to enable ST to achieve nearly compatible performance with supervised learning even with significant label imbalance.
翻译:自我训练(Self-training, ST)是一种简单但有效的半监督学习方法。然而,为何以及如何通过使用可能存在错误的伪标签来提升泛化性能,目前仍未被充分理解。为加深对ST的理解,我们在输入维度与数据规模成比例发散的渐近极限下,通过最小化二元高斯混合数据上的岭正则化凸损失来训练线性分类器,推导并分析了迭代ST行为的精确刻画。结果表明,ST根据迭代次数以不同方式提升泛化性能。当迭代次数较少时,ST通过将模型拟合至相对可靠的伪标签,并在每次迭代中大幅更新模型参数来提升泛化性能,这表明ST在直觉层面有效。另一方面,当迭代次数较多时,ST可通过使用软标签和小正则化逐步增加地更新模型参数,逐渐改善分类平面的方向。我们论证这是因为ST的小幅更新能以近乎无噪声的方式从数据中提取信息。然而,在标签不平衡的情况下,ST的泛化性能劣于使用真实标签的监督学习。为克服这一问题,我们提出了两种启发式方法,使得即使在显著标签不平衡的条件下,ST也能达到与监督学习几乎相当的性能。