Self-supervised pretraining on unlabeled data followed by supervised fine-tuning on labeled data is a popular paradigm for learning from limited labeled examples. We extend this paradigm to the classical positive unlabeled (PU) setting, where the task is to learn a binary classifier given only a few labeled positive samples, and (often) a large amount of unlabeled samples (which could be positive or negative). We first propose a simple extension of standard infoNCE family of contrastive losses, to the PU setting; and show that this learns superior representations, as compared to existing unsupervised and supervised approaches. We then develop a simple methodology to pseudo-label the unlabeled samples using a new PU-specific clustering scheme; these pseudo-labels can then be used to train the final (positive vs. negative) classifier. Our method handily outperforms state-of-the-art PU methods over several standard PU benchmark datasets, while not requiring a-priori knowledge of any class prior (which is a common assumption in other PU methods). We also provide a simple theoretical analysis that motivates our methods.
翻译:自监督预训练(基于无标记数据)后接有监督微调(基于标记数据)是从有限标记样本中学习的常用范式。我们将该范式拓展至经典的正无标记(PU)学习场景,其任务为仅利用少量标记正样本及大量未标记样本(可能包含正类或负类)训练二分类器。首先,我们提出标准infoNCE对比损失族在PU场景下的简单扩展,证明该方法能学习到优于现有无监督及有监督方法的高质量表征;进而,通过构建基于PU特性的新型聚类方案对未标记样本进行伪标记,并利用伪标记训练最终(正类 vs 负类)分类器。该方法在多个标准PU基准数据集上显著优于现有最优PU方法,且无需预先掌握任何类别先验(其他PU方法常依赖此假设)。文中还提供了启发所提方法的简易理论分析。