Analysis of High-dimensional Gaussian Labeled-unlabeled Mixture Model via Message-passing Algorithm

Semi-supervised learning (SSL) is a machine learning methodology that leverages unlabeled data in conjunction with a limited amount of labeled data. Although SSL has been applied in various applications and its effectiveness has been empirically demonstrated, it is still not fully understood when and why SSL performs well. Some existing theoretical studies have attempted to address this issue by modeling classification problems using the so-called Gaussian Mixture Model (GMM). These studies provide notable and insightful interpretations. However, their analyses are focused on specific purposes, and a thorough investigation of the properties of GMM in the context of SSL has been lacking. In this paper, we conduct such a detailed analysis of the properties of the high-dimensional GMM for binary classification in the SSL setting. To this end, we employ the approximate message passing and state evolution methods, which are widely used in high-dimensional settings and originate from statistical mechanics. We deal with two estimation approaches: the Bayesian one and the l2-regularized maximum likelihood estimation (RMLE). We conduct a comprehensive comparison between these two approaches, examining aspects such as the global phase diagram, estimation error for the parameters, and prediction error for the labels. A specific comparison is made between the Bayes-optimal (BO) estimator and RMLE, as the BO setting provides optimal estimation performance and is ideal as a benchmark. Our analysis shows that with appropriate regularizations, RMLE can achieve near-optimal performance in terms of both the estimation error and prediction error, especially when there is a large amount of unlabeled data. These results demonstrate that the l2 regularization term plays an effective role in estimation and prediction in SSL approaches.

翻译：半监督学习（Semi-supervised Learning, SSL）是一种利用未标记数据与有限标记数据的机器学习方法。尽管SSL已在多种应用中得到实践，其有效性也通过实验得到验证，但其在何种情况下以及为何表现良好，仍未得到充分理解。现有的一些理论研究尝试通过所谓的**高斯混合模型**（Gaussian Mixture Model, GMM）对分类问题进行建模，以探讨此问题，并提供了显著且富有洞察力的解释。然而，这些分析多针对特定目的，对GMM在SSL背景下的性质缺乏系统研究。本文针对SSL设置下的二分类问题，对高维GMM的性质进行了详细分析。为此，我们采用了源自统计物理、在高维设置中广泛使用的**近似消息传递**（Approximate Message Passing, AMP）与**状态演化**（State Evolution, SE）方法。我们处理两种估计方法：贝叶斯方法与l2正则化最大似然估计（RMLE）。我们对这两种方法进行了全面比较，考察了全局相图、参数估计误差以及标签预测误差等方面。特别地，我们将贝叶斯最优（Bayes-Optimal, BO）估计器与RMLE进行了对比，因为BO设置提供了最优的估计性能，是理想的基准。我们的分析表明，通过适当的正则化，RMLE在估计误差和预测误差方面均可达到接近最优的性能，尤其是在存在大量未标记数据时。这些结果证明，l2正则化项在SSL方法的估计与预测中发挥着有效作用。