In the paper we argue that performance of the classifiers based on Empirical Risk Minimization (ERM) for positive unlabeled data, which are designed for case-control sampling scheme may significantly deteriorate when applied to a single-sample scenario. We reveal why their behavior depends, in all but very specific cases, on the scenario. Also, we introduce a single-sample case analogue of the popular non-negative risk classifier designed for case-control data and compare its performance with the original proposal. We show that the significant differences occur between them, especiall when half or more positive of observations are labeled. The opposite case when ERM minimizer designed for the case-control case is applied for single-sample data is also considered and similar conclusions are drawn. Taking into account difference of scenarios requires a sole, but crucial, change in the definition of the Empirical Risk.
翻译:本文指出,针对病例对照采样方案设计的基于经验风险最小化的正未标记数据分类器,在应用于单采样场景时性能可能显著下降。我们揭示了其行为依赖于场景的机制(除极特殊情况外)。同时,我们引入了针对病例对照数据设计的流行非负风险分类器在单采样场景的类比模型,并与原始方案进行了性能对比。研究表明,当半数或更多正样本被标记时,两种方案之间会出现显著差异。此外,本文还探讨了将病例对照场景下设计的经验风险最小化器应用于单采样数据的相反情形,并得出了相似结论。考虑场景差异需要对经验风险定义进行唯一但关键性的调整。