Self-supervised pretraining on unlabeled data followed by supervised fine-tuning on labeled data is a popular paradigm for learning from limited labeled examples. We extend this paradigm to the classical positive unlabeled (PU) setting, where the task is to learn a binary classifier given only a few labeled positive samples, and (often) a large amount of unlabeled samples (which could be positive or negative). We first propose a simple extension of standard infoNCE family of contrastive losses, to the PU setting; and show that this learns superior representations, as compared to existing unsupervised and supervised approaches. We then develop a simple methodology to pseudo-label the unlabeled samples using a new PU-specific clustering scheme; these pseudo-labels can then be used to train the final (positive vs. negative) classifier. Our method handily outperforms state-of-the-art PU methods over several standard PU benchmark datasets, while not requiring a-priori knowledge of any class prior (which is a common assumption in other PU methods). We also provide a simple theoretical analysis that motivates our methods.
翻译:自监督预训练(利用无标签数据)后接有监督微调(利用有标签数据)是从有限标注样本中学习的流行范式。我们将此范式拓展至经典的正无标签(Positive Unlabeled, PU)场景,其任务是在仅有少量标注正样本和大量未标注样本(可能为正或负)的情况下训练二元分类器。首先,我们提出将标准InfoNCE系列对比损失简单扩展至PU场景,并证明该方法比现有无监督及有监督方法能学习到更优的表征。随后,我们开发了一种基于新型PU特定聚类方案的简单伪标签生成方法,这些伪标签可用于训练最终的(正vs负)分类器。在多个标准PU基准数据集上,我们的方法显著优于现有最先进的PU方法,且无需任何类别先验知识(此为其他PU方法的常见假设)。我们还提供了简单的理论分析以支撑所提方法。