Semi-supervised learning is a powerful technique for leveraging unlabeled data to improve machine learning models, but it can be affected by the presence of ``informative'' labels, which occur when some classes are more likely to be labeled than others. In the missing data literature, such labels are called missing not at random. In this paper, we propose a novel approach to address this issue by estimating the missing-data mechanism and using inverse propensity weighting to debias any SSL algorithm, including those using data augmentation. We also propose a likelihood ratio test to assess whether or not labels are indeed informative. Finally, we demonstrate the performance of the proposed methods on different datasets, in particular on two medical datasets for which we design pseudo-realistic missing data scenarios.
翻译:半监督学习是利用未标注数据改进机器学习模型的强大技术,但当某些类别更易获得标签时(即标签存在“信息性”),该技术可能受到影响。在缺失数据文献中,此类标签被称为“非随机缺失”。本文提出一种创新方法解决该问题:通过估计缺失数据机制并采用逆倾向得分加权方法,对包括数据增强在内的任意半监督学习算法进行去偏处理。同时,我们提出一种似然比检验方法评估标签是否确实具有信息性。最后,我们在多个数据集上验证所提方法的性能,特别针对两个医疗数据集设计了拟现实的缺失数据场景。