Semi-supervised learning (SSL) has been a fundamental challenge in machine learning for decades. The primary family of SSL algorithms, known as pseudo-labeling, involves assigning pseudo-labels to confident unlabeled instances and incorporating them into the training set. Therefore, the selection criteria of confident instances are crucial to the success of SSL. Recently, there has been growing interest in the development of SSL methods that use dynamic or adaptive thresholds. Yet, these methods typically apply the same threshold to all samples, or use class-dependent thresholds for instances belonging to a certain class, while neglecting instance-level information. In this paper, we propose the study of instance-dependent thresholds, which has the highest degree of freedom compared with existing methods. Specifically, we devise a novel instance-dependent threshold function for all unlabeled instances by utilizing their instance-level ambiguity and the instance-dependent error rates of pseudo-labels, so instances that are more likely to have incorrect pseudo-labels will have higher thresholds. Furthermore, we demonstrate that our instance-dependent threshold function provides a bounded probabilistic guarantee for the correctness of the pseudo-labels it assigns.
翻译:半监督学习(SSL)数十年来一直是机器学习领域的核心挑战。伪标签作为SSL的主要算法家族,其核心思想是为置信度较高的无标记实例分配伪标签并将其纳入训练集。因此,高置信度实例的筛选标准对SSL的成功至关重要。近年来,采用动态或自适应阈值的SSL方法日益受到关注。然而,这些方法通常对所有样本统一采用固定阈值,或仅对属于特定类别的实例使用类依赖阈值,而忽略了实例级别的信息。本文提出研究实例依赖阈值方法,与现有方法相比具有最高自由度。具体而言,我们通过利用所有无标记实例的实例级模糊度及其伪标签的实例依赖错误率,设计了一种新颖的实例依赖阈值函数——那些更容易出现错误伪标签的实例将获得更高阈值。此外,我们证明该实例依赖阈值函数能够为所分配的伪标签的正确性提供有界的概率保证。