Self-training has gained attraction because of its simplicity and versatility, yet it is vulnerable to noisy pseudo-labels caused by erroneous confidence. Several solutions have been proposed to handle the problem, but they require significant modifications in self-training algorithms or model architecture, and most have limited applicability in tabular domains. To address this issue, we explore a novel direction of reliable confidence in self-training contexts and conclude that the confidence, which represents the value of the pseudo-label, should be aware of the cluster assumption. In this regard, we propose Cluster-Aware Self-Training (CAST) for tabular data, which enhances existing self-training algorithms at a negligible cost without significant modifications. Concretely, CAST regularizes the confidence of the classifier by leveraging local density for each class in the labeled training data, forcing the pseudo-labels in low-density regions to have lower confidence. Extensive empirical evaluations on up to 21 real-world datasets confirm not only the superior performance of CAST but also its robustness in various setups in self-training contexts.
翻译:摘要:自训练因其简洁性和通用性而备受关注,但易受错误置信度导致的噪声伪标签影响。现有解决方案虽能处理该问题,却需对自训练算法或模型架构进行重大修改,且大多在表格数据领域适用性有限。为解决此问题,我们探索了自训练场景中可靠置信度的新方向,发现代表伪标签价值的置信度应感知聚类假设。基于此,我们提出面向表格数据的聚类感知自训练方法(CAST),该方法以极低代价增强现有自训练算法,无需重大修改。具体而言,CAST通过利用标记训练数据中每个类别的局部密度来正则化分类器的置信度,迫使低密度区域的伪标签具有更低置信度。在多达21个真实数据集上的广泛实证评估不仅验证了CAST的卓越性能,还证实了其在自训练场景多种设置下的鲁棒性。