Semi-supervised algorithms aim to learn prediction functions from a small set of labeled observations and a large set of unlabeled observations. Because this framework is relevant in many applications, they have received a lot of interest in both academia and industry. Among the existing techniques, self-training methods have undoubtedly attracted greater attention in recent years. These models are designed to find the decision boundary on low density regions without making additional assumptions about the data distribution, and use the unsigned output score of a learned classifier, or its margin, as an indicator of confidence. The working principle of self-training algorithms is to learn a classifier iteratively by assigning pseudo-labels to the set of unlabeled training samples with a margin greater than a certain threshold. The pseudo-labeled examples are then used to enrich the labeled training data and to train a new classifier in conjunction with the labeled training set. In this paper, we present self-training methods for binary and multi-class classification; as well as their variants and two related approaches, namely consistency-based approaches and transductive learning. We examine the impact of significant self-training features on various methods, using different general and image classification benchmarks, and we discuss our ideas for future research in self-training. To the best of our knowledge, this is the first thorough and complete survey on this subject.
翻译:半监督算法旨在从少量带标签观测样本和大量无标签观测样本中学习预测函数。由于该框架与众多应用场景相关,因此在学术界和工业界均受到广泛关注。在现有技术中,自训练方法近年来无疑吸引了更多关注。这类模型旨在低密度区域寻找决策边界,无需对数据分布做出额外假设,并利用学习分类器的无符号输出得分或间隔作为置信度指标。自训练算法的工作原理是:通过为间隔大于特定阈值的无标签训练样本集分配伪标签,迭代地训练分类器。随后,这些伪标签样本被用于扩充带标签训练数据,并与带标签训练集共同训练新分类器。本文介绍了面向二分类和多分类的自训练方法,涵盖其变体及两种相关方法(即基于一致性的方法和直推学习)。我们采用不同的通用分类基准和图像分类基准,考察了关键自训练特征对各类方法的影响,并讨论了自训练领域未来研究的思路。据我们所知,这是首篇对该主题进行全面系统综述的文献。