We introduce a new observational setting for Positive Unlabeled (PU) data where the observations at prediction time are also labeled. This occurs commonly in practice -- we argue that the additional information is important for prediction, and call this task "augmented PU prediction". We allow for labeling to be feature dependent. In such scenario, Bayes classifier and its risk is established and compared with a risk of a classifier which for unlabeled data is based only on predictors. We introduce several variants of the empirical Bayes rule in such scenario and investigate their performance. We emphasise dangers (and ease) of applying classical classification rule in the augmented PU scenario -- due to no preexisting studies, an unaware researcher is prone to skewing the obtained predictions. We conclude that the variant based on recently proposed variational autoencoder designed for PU scenario works on par or better than other considered variants and yields advantage over feature-only based methods in terms of accuracy for unlabeled samples.
翻译:我们针对正未标记数据提出了一种新的观测设置,其中预测时段的观测数据同样带有标签。这在实践中十分常见——我们认为这一额外信息对预测至关重要,并将该任务称为"增强型PU预测"。我们允许标注过程具有特征依赖性。在此场景下,我们建立了贝叶斯分类器及其风险函数,并与仅基于预测变量处理未标记数据的分类器风险进行了比较。我们提出了该场景下经验贝叶斯规则的若干变体,并研究了它们的性能表现。我们特别强调了在增强型PU场景中应用经典分类规则的潜在风险(及其易发性)——由于缺乏既有研究,未意识到该特性的研究者极易导致预测结果产生系统性偏差。最终我们发现,基于近期提出的专为PU场景设计的变分自编码器所构建的变体,其性能与其他考量变体相当或更优,并且在未标记样本的预测准确度方面较仅基于特征的方法具有显著优势。