Self-training is a simple yet effective method within semi-supervised learning. The idea is to iteratively enhance training data by adding pseudo-labeled data. Its generalization performance heavily depends on the selection of these pseudo-labeled data (PLS). In this paper, we aim at rendering PLS more robust towards the involved modeling assumptions. To this end, we propose to select pseudo-labeled data that maximize a multi-objective utility function. The latter is constructed to account for different sources of uncertainty, three of which we discuss in more detail: model selection, accumulation of errors and covariate shift. In the absence of second-order information on such uncertainties, we furthermore consider the generic approach of the generalized Bayesian alpha-cut updating rule for credal sets. As a practical proof of concept, we spotlight the application of three of our robust extensions on simulated and real-world data. Results suggest that in particular robustness w.r.t. model choice can lead to substantial accuracy gains.
翻译:自训练是半监督学习中简单而有效的方法,其核心思想是通过迭代添加伪标签数据来增强训练数据。该方法的泛化性能高度依赖于伪标签数据的选择(PLS)。本文旨在提升PLS对相关建模假设的鲁棒性。为此,我们提出选取能最大化多目标效用函数的伪标签数据。该函数被设计用于考虑不同来源的不确定性,其中重点讨论三类:模型选择、误差累积和协变量偏移。在缺乏此类不确定性二阶信息的情况下,我们进一步采用信用集广义贝叶斯α-截断更新规则这一通用方法。作为实践性概念验证,我们重点展示了三种鲁棒扩展方法在模拟数据和真实世界数据上的应用。结果表明,尤其在模型选择方面的鲁棒性可带来显著的精度提升。