Positive Unlabeled (PU) learning refers to the task of learning a binary classifier given a few labeled positive samples, and a set of unlabeled samples (which could be positive or negative). In this paper, we propose a novel PU learning framework, that starts by learning a feature space through pretext-invariant representation learning and then applies pseudo-labeling to the unlabeled examples, leveraging the concentration property of the embeddings. Overall, our proposed approach handily outperforms state-of-the-art PU learning methods across several standard PU benchmark datasets, while not requiring a-priori knowledge or estimate of class prior. Remarkably, our method remains effective even when labeled data is scant, where most PU learning algorithms falter. We also provide simple theoretical analysis motivating our proposed algorithms and establish generalization guarantee for our approach.
翻译:正无标签学习(Positive Unlabeled Learning, PU学习)是指在仅拥有少量标注正样本和一组未标注样本(可能为正或负)的情况下,学习二分类器的任务。本文提出了一种新颖的PU学习框架,该框架首先通过无监督不变表示学习获得特征空间,随后利用嵌入向量的聚集特性对未标注样本进行伪标签标注。总体而言,我们提出的方法在多个标准PU基准数据集上显著优于现有最优PU学习方法,且无需先验知识或类别先验估计。值得注意的是,即使标注数据极为稀少——多数PU学习算法在此场景下性能显著下降——我们的方法依然保持有效。我们还提供了启发式理论分析来验证所提算法,并建立了方法的泛化保证。