Semi-supervised learning (SSL) assumes that neighbor points lie in the same category (neighbor assumption), and points in different clusters belong to various categories (cluster assumption). Existing methods usually rely on similarity measures to retrieve the similar neighbor points, ignoring cluster assumption, which may not utilize unlabeled information sufficiently and effectively. This paper first provides a systematical investigation into the significant role of probability density in SSL and lays a solid theoretical foundation for cluster assumption. To this end, we introduce a Probability-Density-Aware Measure (PM) to discern the similarity between neighbor points. To further improve Label Propagation, we also design a Probability-Density-Aware Measure Label Propagation (PMLP) algorithm to fully consider the cluster assumption in label propagation. Last but not least, we prove that traditional pseudo-labeling could be viewed as a particular case of PMLP, which provides a comprehensive theoretical understanding of PMLP's superior performance. Extensive experiments demonstrate that PMLP achieves outstanding performance compared with other recent methods.
翻译:半监督学习(SSL)基于邻域假设(相邻点属于同一类别)和聚类假设(不同簇中的点属于不同类别)。现有方法通常依赖相似性度量来检索相似邻域点,却忽略了聚类假设,这可能无法充分且有效地利用未标记信息。本文首先系统性地探讨了概率密度在SSL中的重要作用,并为聚类假设奠定了坚实的理论基础。为此,我们引入了概率密度感知度量(PM)来判别相邻点之间的相似性。为了进一步改进标签传播,我们还设计了概率密度感知度量标签传播(PMLP)算法,以在标签传播过程中充分考虑聚类假设。最后但同样重要的是,我们证明了传统伪标签方法可视为PMLP的一种特例,这为理解PMLP的优越性能提供了全面的理论依据。大量实验表明,相较于其他最新方法,PMLP取得了卓越的性能。