Recent progress in semi- and self-supervised learning has caused a rift in the long-held belief about the need for an enormous amount of labeled data for machine learning and the irrelevancy of unlabeled data. Although it has been successful in various data, there is no dominant semi- and self-supervised learning method that can be generalized for tabular data (i.e. most of the existing methods require appropriate tabular datasets and architectures). In this paper, we revisit self-training which can be applied to any kind of algorithm including the most widely used architecture, gradient boosting decision tree, and introduce curriculum pseudo-labeling (a state-of-the-art pseudo-labeling technique in image) for a tabular domain. Furthermore, existing pseudo-labeling techniques do not assure the cluster assumption when computing confidence scores of pseudo-labels generated from unlabeled data. To overcome this issue, we propose a novel pseudo-labeling approach that regularizes the confidence scores based on the likelihoods of the pseudo-labels so that more reliable pseudo-labels which lie in high density regions can be obtained. We exhaustively validate the superiority of our approaches using various models and tabular datasets.
翻译:最近半监督和自监督学习的进展打破了长期以来的信念,即机器学习需要大量标注数据而无需考虑未标注数据。尽管这些方法在各种数据上取得了成功,但尚无一种能够普遍适用于表格数据的半监督和自监督学习方法(即现有方法大多需要合适的表格数据集和架构)。本文重新审视了可应用于任何算法(包括最广泛使用的梯度提升决策树)的自训练方法,并针对表格领域引入了课程式伪标签(图像领域最先进的伪标签技术)。此外,现有伪标签技术在计算未标注数据生成的伪标签置信度时,无法保证聚类假设。为解决这一问题,我们提出了一种新颖的伪标签方法,该方法基于伪标签的似然性对置信度进行正则化,从而获取位于高密度区域的更可靠的伪标签。我们通过各种模型和表格数据集全面验证了所提方法的优越性。