In many classification problems, the costs of misclassifying observations from different classes can be highly unequal. The Neyman-Pearson multiclass classification (NPMC) framework addresses this issue by minimizing a weighted misclassification risk while imposing upper bounds on class-specific error probabilities. Existing NPMC methods typically assume that training labels are correctly observed. In practice, however, labels are often corrupted due to measurement error or annotation, and the effect of such label noise on NPMC procedures remains largely unexplored. We study the NPMC problem when only noisy labels are available in the training data. We propose an empirical likelihood (EL)-based method that relates the distributions of noisy and true labels through an exponential tilting density ratio model. The resulting maximum EL estimators recover the class proportions and posterior probabilities of the clean labels required for error control. We establish consistency, asymptotic normality, and optimal convergence rates for these estimators. Under mild conditions, the resulting classifier satisfies NP oracle inequalities with respect to the true labels asymptotically. An expectation-maximization algorithm computes the maximum EL estimators. Simulations show that the proposed method performs comparably to the oracle classifier under clean labels and substantially improves over procedures that ignore label noise.
翻译:在许多分类问题中,误分类不同类别观测数据的代价可能高度不均衡。Neyman-Pearson多类分类(NPMC)框架通过最小化加权误分类风险并同时对各类别的错误概率施加上限来解决这一问题。现有的NPMC方法通常假设训练标签被正确观测。然而在实践中,由于测量误差或标注问题,标签常常受到污染,而此类标签噪声对NPMC过程的影响尚未得到充分研究。本文研究了训练数据仅包含噪声标签时的NPMC问题。我们提出一种基于经验似然(EL)的方法,通过指数倾斜密度比模型建立噪声标签与真实标签分布之间的联系。由此得到的最大经验似然估计量能够恢复用于误差控制的干净标签的类别比例和后验概率。我们证明了这些估计量的一致性、渐近正态性以及最优收敛速度。在温和条件下,所得分类器在渐近意义上满足关于真实标签的NP最优性不等式的约束。通过期望最大化算法计算最大经验似然估计量。仿真实验表明,所提方法在干净标签下与基准分类器性能相当,且显著优于忽略标签噪声的方法。