Recent progress in semi- and self-supervised learning has caused a rift in the long-held belief about the need for an enormous amount of labeled data for machine learning and the irrelevancy of unlabeled data. Although it has been successful in various data, there is no dominant semi- and self-supervised learning method that can be generalized for tabular data (i.e. most of the existing methods require appropriate tabular datasets and architectures). In this paper, we revisit self-training which can be applied to any kind of algorithm including the most widely used architecture, gradient boosting decision tree, and introduce curriculum pseudo-labeling (a state-of-the-art pseudo-labeling technique in image) for a tabular domain. Furthermore, existing pseudo-labeling techniques do not assure the cluster assumption when computing confidence scores of pseudo-labels generated from unlabeled data. To overcome this issue, we propose a novel pseudo-labeling approach that regularizes the confidence scores based on the likelihoods of the pseudo-labels so that more reliable pseudo-labels which lie in high density regions can be obtained. We exhaustively validate the superiority of our approaches using various models and tabular datasets.
翻译:半监督和自监督学习的最新进展动摇了长期以来关于机器学习需要大量标注数据以及无标注数据无关性的传统观念。尽管这些方法在多种数据类型上取得了成功,但目前尚不存在能够普遍适用于表格数据的半监督或自监督学习方法(即现有方法大多需要适配的表格数据集和架构)。本文重新审视了可应用于任意算法(包括最广泛使用的梯度提升决策树架构)的自训练方法,并针对表格领域引入了课程伪标签(一种图像领域的前沿伪标签技术)。此外,现有伪标签技术在计算从无标注数据生成的伪标签置信度时,无法保证聚类假设成立。为解决该问题,我们提出了一种新颖的伪标签方法,该方法基于伪标签的似然性对置信度进行正则化,从而获得位于高密度区域的更可靠伪标签。我们通过使用多种模型和表格数据集,全面验证了所提方法的优越性。