We provide an exact characterization of the expected generalization error (gen-error) for semi-supervised learning (SSL) with pseudo-labeling via the Gibbs algorithm. The gen-error is expressed in terms of the symmetrized KL information between the output hypothesis, the pseudo-labeled dataset, and the labeled dataset. Distribution-free upper and lower bounds on the gen-error can also be obtained. Our findings offer new insights that the generalization performance of SSL with pseudo-labeling is affected not only by the information between the output hypothesis and input training data but also by the information {\em shared} between the {\em labeled} and {\em pseudo-labeled} data samples. This serves as a guideline to choose an appropriate pseudo-labeling method from a given family of methods. To deepen our understanding, we further explore two examples -- mean estimation and logistic regression. In particular, we analyze how the ratio of the number of unlabeled to labeled data $\lambda$ affects the gen-error under both scenarios. As $\lambda$ increases, the gen-error for mean estimation decreases and then saturates at a value larger than when all the samples are labeled, and the gap can be quantified {\em exactly} with our analysis, and is dependent on the \emph{cross-covariance} between the labeled and pseudo-labeled data samples. For logistic regression, the gen-error and the variance component of the excess risk also decrease as $\lambda$ increases.
翻译:我们通过Gibbs算法对半监督学习中伪标记方法的期望泛化误差进行了精确刻画。该泛化误差可表示为输出假设、伪标记数据集与标记数据集之间的对称KL信息量的函数,并由此推导出泛化误差的无分布上下界。研究发现揭示了重要新见解:采用伪标记的半监督学习的泛化性能不仅受输出假设与输入训练数据之间信息量的影响,还受标记样本与伪标记样本之间共享信息量的制约。这一结论为从给定方法族中选取合适的伪标记方法提供了指导准则。为深化理解,我们进一步探讨了两个典型案例——均值估计与逻辑回归。具体而言,我们分析了无标记与有标记样本数量比λ如何影响这两种场景下的泛化误差。当λ增大时,均值估计的泛化误差先降低后趋于饱和,其渐近值始终大于全标记样本场景,该差异可通过分析精确量化,且与标记样本与伪标记样本的交叉协方差相关。对于逻辑回归,泛化误差及超额风险中的方差分量同样随λ增大而衰减。