Learning from positive and unlabeled data is known as positive-unlabeled (PU) learning in literature and has attracted much attention in recent years. One common approach in PU learning is to sample a set of pseudo-negatives from the unlabeled data using ad-hoc thresholds so that conventional supervised methods can be applied with both positive and negative samples. Owing to the label uncertainty among the unlabeled data, errors of misclassifying unlabeled positive samples as negative samples inevitably appear and may even accumulate during the training processes. Those errors often lead to performance degradation and model instability. To mitigate the impact of label uncertainty and improve the robustness of learning with positive and unlabeled data, we propose a new robust PU learning method with a training strategy motivated by the nature of human learning: easy cases should be learned first. Similar intuition has been utilized in curriculum learning to only use easier cases in the early stage of training before introducing more complex cases. Specifically, we utilize a novel ``hardness'' measure to distinguish unlabeled samples with a high chance of being negative from unlabeled samples with large label noise. An iterative training strategy is then implemented to fine-tune the selection of negative samples during the training process in an iterative manner to include more ``easy'' samples in the early stage of training. Extensive experimental validations over a wide range of learning tasks show that this approach can effectively improve the accuracy and stability of learning with positive and unlabeled data. Our code is available at https://github.com/woriazzc/Robust-PU
翻译:正无标记(Positive-Unlabeled, PU)学习旨在从正样本与无标记数据中学习,近年来备受关注。常见方法是从无标记数据中通过启发式阈值采样伪负样本,从而利用正负样本对应用传统监督学习方法。然而,无标记数据的标签不确定性会导致将正样本误分类为负样本的错误,且在训练过程中可能不断累积,最终造成性能下降与模型不稳定。为缓解标签不确定性的影响并提升正无标记学习的鲁棒性,我们提出了一种受人类学习规律启发的新方法:先学习简单案例。这一直觉在课程学习中被广泛应用,即在训练初期仅使用简单案例,再逐步引入复杂案例。具体而言,我们采用一种新的“困难度”度量,以区分高概率为负样本的无标记数据与存在较大标签噪声的无标记数据。随后,通过迭代训练策略在训练过程中动态优化负样本选择:初期优先纳入更多“简单”样本。在多种学习任务上的广泛实验验证表明,该方法能有效提升正无标记学习的准确性与稳定性。代码已公开于 https://github.com/woriazzc/Robust-PU。