Semi-supervised learning by self-training heavily relies on pseudo-label selection (PLS). The selection often depends on the initial model fit on labeled data. Early overfitting might thus be propagated to the final model by selecting instances with overconfident but erroneous predictions, often referred to as confirmation bias. This paper introduces BPLS, a Bayesian framework for PLS that aims to mitigate this issue. At its core lies a criterion for selecting instances to label: an analytical approximation of the posterior predictive of pseudo-samples. We derive this selection criterion by proving Bayes optimality of the posterior predictive of pseudo-samples. We further overcome computational hurdles by approximating the criterion analytically. Its relation to the marginal likelihood allows us to come up with an approximation based on Laplace's method and the Gaussian integral. We empirically assess BPLS for parametric generalized linear and non-parametric generalized additive models on simulated and real-world data. When faced with high-dimensional data prone to overfitting, BPLS outperforms traditional PLS methods.
翻译:半监督学习中的自训练方法高度依赖伪标签选择(PLS)。该选择过程通常依赖于初始模型在标注数据上的拟合结果,因此早期过拟合可能通过选择过度自信但预测错误(即常说的确认偏差)的样本,被传播至最终模型。本文提出BPLS——一种旨在缓解此问题的贝叶斯伪标签选择框架,其核心是选择待标注样本的准则:伪样本后验预测的分析近似。我们通过证明伪样本后验预测的贝叶斯最优性推导出该选择准则,并利用解析近似克服计算障碍。该准则与边际似然的关联使我们能够基于拉普拉斯方法和高斯积分提出近似方案。我们通过参数化广义线性模型与非参数化广义可加模型,在仿真数据和真实数据上对BPLS进行实证评估。面对易发生过拟合的高维数据时,BPLS优于传统伪标签选择方法。