Beyond Myopia: Learning from Positive and Unlabeled Data through Holistic Predictive Trends

Learning binary classifiers from positive and unlabeled data (PUL) is vital in many real-world applications, especially when verifying negative examples is difficult. Despite the impressive empirical performance of recent PUL methods, challenges like accumulated errors and increased estimation bias persist due to the absence of negative labels. In this paper, we unveil an intriguing yet long-overlooked observation in PUL: \textit{resampling the positive data in each training iteration to ensure a balanced distribution between positive and unlabeled examples results in strong early-stage performance. Furthermore, predictive trends for positive and negative classes display distinctly different patterns.} Specifically, the scores (output probability) of unlabeled negative examples consistently decrease, while those of unlabeled positive examples show largely chaotic trends. Instead of focusing on classification within individual time frames, we innovatively adopt a holistic approach, interpreting the scores of each example as a temporal point process (TPP). This reformulates the core problem of PUL as recognizing trends in these scores. We then propose a novel TPP-inspired measure for trend detection and prove its asymptotic unbiasedness in predicting changes. Notably, our method accomplishes PUL without requiring additional parameter tuning or prior assumptions, offering an alternative perspective for tackling this problem. Extensive experiments verify the superiority of our method, particularly in a highly imbalanced real-world setting, where it achieves improvements of up to $11.3\%$ in key metrics. The code is available at \href{https://github.com/wxr99/HolisticPU}{https://github.com/wxr99/HolisticPU}.

翻译：从正例和无标注数据中学习二分类器（PUL）在众多实际应用中至关重要，尤其是在难以验证负例的情况下。尽管近期PUL方法展现出令人印象深刻的实证性能，但由于缺乏负标签，累积误差和估计偏差增加等挑战依然存在。本文揭示了PUL中一个有趣但长期被忽视的发现：*在每个训练迭代中对正例数据进行重采样以确保正例与无标注样本的分布均衡，会带来较强的早期性能；此外，正类和负类的预测趋势呈现出显著不同的模式*。具体而言，无标注负样本的分数（输出概率）持续下降，而无标注正样本的分数则表现出高度混沌的趋势。我们不局限于单个时间框架内的分类，而是创新性地采用整体方法，将每个样本的分数解释为时间点过程（TPP）。这重新将PUL的核心问题转化为识别这些分数的趋势。我们随后提出了一种新颖的基于TPP的趋势检测度量，并证明了其在预测变化中的渐近无偏性。值得注意的是，我们的方法无需额外参数调整或先验假设即可完成PUL，为解决该问题提供了另一种视角。大量实验验证了我们方法的优越性，特别是在高度不平衡的真实场景中，关键指标提升高达$11.3\%$。代码见：\href{https://github.com/wxr99/HolisticPU}{https://github.com/wxr99/HolisticPU}。