Learning from positive and unlabeled data (PU learning) is actively researched machine learning task. The goal is to train a binary classification model based on a training dataset containing part of positives which are labeled, and unlabeled instances. Unlabeled set includes remaining part of positives and all negative observations. An important element in PU learning is modeling of the labeling mechanism, i.e. labels' assignment to positive observations. Unlike in many prior works, we consider a realistic setting for which probability of label assignment, i.e. propensity score, is instance-dependent. In our approach we investigate minimizer of an empirical counterpart of a joint risk which depends on both posterior probability of inclusion in a positive class as well as on a propensity score. The non-convex empirical risk is alternately optimised with respect to parameters of both functions. In the theoretical analysis we establish risk consistency of the minimisers using recently derived methods from the theory of empirical processes. Besides, the important development here is a proposed novel implementation of an optimisation algorithm, for which sequential approximation of a set of positive observations among unlabeled ones is crucial. This relies on modified technique of 'spies' as well as on a thresholding rule based on conditional probabilities. Experiments conducted on 20 data sets for various labeling scenarios show that the proposed method works on par or more effectively than state-of-the-art methods based on propensity function estimation.
翻译:从正例和无标记数据中学习(PU学习)是一项广泛研究的机器学习任务。其目标是根据一个包含部分已标记正例和无标记实例的训练数据集训练二分类模型。无标记集合包含剩余的正例以及所有负例观测值。PU学习中的一个关键要素是对标记机制(即标签分配给正例观测值的过程)进行建模。与许多先前研究不同,我们考虑了一个更现实的场景,即标签分配概率(倾向得分)是实例依赖的。在我们的方法中,我们研究了一个联合风险的经验估计的最小化问题,该风险同时依赖于正类包含的后验概率以及倾向得分。非凸经验风险通过交替优化两个函数的参数进行求解。在理论分析中,我们利用经验过程理论的最新推导方法,建立了最小化器的一致性风险。此外,本文的一个重要进展是提出了一种优化算法的新型实现方法,其中对无标记样本中正例观测值集合的序贯近似至关重要。该方法基于改进的“间谍”技术以及基于条件概率的阈值规则。在20个数据集上针对不同标记场景进行的实验表明,所提出的方法在性能上与基于倾向得分估计的现有方法相当或更优。