Positive-Unlabeled (PU) Learning is a challenge presented by binary classification problems where there is an abundance of unlabeled data along with a small number of positive data instances, which can be used to address chronic disease screening problem. State-of-the-art PU learning methods have resulted in the development of various risk estimators, yet they neglect the differences among distinct populations. To address this issue, we present a novel Positive-Unlabeled Learning Tree (PUtree) algorithm. PUtree is designed to take into account communities such as different age or income brackets, in tasks of chronic disease prediction. We propose a novel approach for binary decision-making, which hierarchically builds community-based PU models and then aggregates their deliverables. Our method can explicate each PU model on the tree for the optimized non-leaf PU node splitting. Furthermore, a mask-recovery data augmentation strategy enables sufficient training of the model in individual communities. Additionally, the proposed approach includes an adversarial PU risk estimator to capture hierarchical PU-relationships, and a model fusion network that integrates data from each tree path, resulting in robust binary classification results. We demonstrate the superior performance of PUtree as well as its variants on two benchmarks and a new diabetes-prediction dataset.
翻译:正无标注(PU)学习是二分类问题中的一项挑战,该问题存在大量无标注数据以及少量正实例数据,可用于解决慢性病筛查问题。现有的先进PU学习方法已发展出多种风险估计器,但它们忽略了不同人群之间的差异。为解决这一问题,我们提出了一种新颖的正无标注学习树(PUtree)算法。PUtree旨在考虑不同社区(如不同年龄或收入群体)在慢性病预测任务中的差异。我们提出了一种新颖的二元决策方法,该方法分层构建基于社区的PU模型,然后聚合它们的输出。我们的方法可以解释树上的每个PU模型,以实现优化的非叶PU节点分裂。此外,一种掩码恢复数据增强策略使得模型能够在各个社区中得到充分训练。另外,所提出的方法包括一个对抗性PU风险估计器,用于捕获分层PU关系,以及一个模型融合网络,该网络整合来自每条树路径的数据,从而产生稳健的二分类结果。我们在两个基准数据集和一个新的糖尿病预测数据集上展示了PUtree及其变体的优越性能。