Semi-supervised learning by self-training heavily relies on pseudo-label selection (PLS). The selection often depends on the initial model fit on labeled data. Early overfitting might thus be propagated to the final model by selecting instances with overconfident but erroneous predictions, often referred to as confirmation bias. This paper introduces BPLS, a Bayesian framework for PLS that aims to mitigate this issue. At its core lies a criterion for selecting instances to label: an analytical approximation of the posterior predictive of pseudo-samples. We derive this selection criterion by proving Bayes optimality of the posterior predictive of pseudo-samples. We further overcome computational hurdles by approximating the criterion analytically. Its relation to the marginal likelihood allows us to come up with an approximation based on Laplace's method and the Gaussian integral. We empirically assess BPLS for parametric generalized linear and non-parametric generalized additive models on simulated and real-world data. When faced with high-dimensional data prone to overfitting, BPLS outperforms traditional PLS methods.
翻译:半监督学习中的自训练方法高度依赖于伪标签选择(PLS)。传统的伪标签选择通常依赖于在标记数据上训练的初始模型,因此早期过拟合可能通过选择具有过度自信但错误预测的样本而传播至最终模型,这一现象常被称为确认偏差。本文提出BPLS——一种旨在缓解该问题的贝叶斯伪标签选择框架,其核心是一个用于选择待标记样本的准则:对伪样本后验预测的解析近似。我们通过证明伪样本后验预测的贝叶斯最优性推导出该选择准则,并进一步利用解析逼近克服计算障碍。基于该准则与边缘似然的关系,我们通过拉普拉斯方法和高斯积分提出其近似形式。我们在模拟数据和真实数据上,对参数化广义线性模型与非参数化广义可加模型进行了BPLS的实证评估。当面对易发生过拟合的高维数据时,BPLS的表现优于传统伪标签选择方法。