Partial Label Learning (PLL) is a typical weakly supervised learning task, which assumes each training instance is annotated with a set of candidate labels containing the ground-truth label. Recent PLL methods adopt identification-based disambiguation to alleviate the influence of false positive labels and achieve promising performance. However, they require all classes in the test set to have appeared in the training set, ignoring the fact that new classes will keep emerging in real applications. To address this issue, in this paper, we focus on the problem of Partial Label Learning with Augmented Class (PLLAC), where one or more augmented classes are not visible in the training stage but appear in the inference stage. Specifically, we propose an unbiased risk estimator with theoretical guarantees for PLLAC, which estimates the distribution of augmented classes by differentiating the distribution of known classes from unlabeled data and can be equipped with arbitrary PLL loss functions. Besides, we provide a theoretical analysis of the estimation error bound of the estimator, which guarantees the convergence of the empirical risk minimizer to the true risk minimizer as the number of training data tends to infinity. Furthermore, we add a risk-penalty regularization term in the optimization objective to alleviate the influence of the over-fitting issue caused by negative empirical risk. Extensive experiments on benchmark, UCI and real-world datasets demonstrate the effectiveness of the proposed approach.
翻译:偏标记学习(PLL)是一种典型的弱监督学习任务,它假设每个训练实例都标注有一个包含真实标签的候选标签集。近期的PLL方法采用基于识别的消歧策略来减轻假阳性标签的影响,并取得了良好的性能。然而,这些方法要求测试集中的所有类别在训练集中均已出现,忽略了现实应用中新类别会不断涌现的事实。为解决这一问题,本文聚焦于带增强类的偏标记学习(PLLAC)问题,其中存在一个或多个增强类在训练阶段不可见,但在推理阶段出现。具体而言,我们提出了一种具有理论保证的PLLAC无偏风险估计器,该估计器通过从未标记数据中区分已知类别的分布来估计增强类别的分布,并且可以与任意的PLL损失函数结合使用。此外,我们对该估计器的估计误差界进行了理论分析,证明了当训练数据量趋于无穷时,经验风险最小化器会收敛于真实风险最小化器。进一步地,我们在优化目标中添加了风险惩罚正则项,以减轻负经验风险引起的过拟合问题的影响。在基准数据集、UCI数据集和真实世界数据集上进行的大量实验验证了所提方法的有效性。