In supervised learning, automatically assessing the quality of the labels before any learning takes place remains an open research question. In certain particular cases, hypothesis testing procedures have been proposed to assess whether a given instance-label dataset is contaminated with class-conditional label noise, as opposed to uniform label noise. The existing theory builds on the asymptotic properties of the Maximum Likelihood Estimate for parametric logistic regression. However, the parametric assumptions on top of which these approaches are constructed are often too strong and unrealistic in practice. To alleviate this problem, in this paper we propose an alternative path by showing how similar procedures can be followed when the underlying model is a product of Local Maximum Likelihood Estimation that leads to more flexible nonparametric logistic regression models, which in turn are less susceptible to model misspecification. This different view allows for wider applicability of the tests by offering users access to a richer model class. Similarly to existing works, we assume we have access to anchor points which are provided by the users. We introduce the necessary ingredients for the adaptation of the hypothesis tests to the case of nonparametric logistic regression and empirically compare against the parametric approach presenting both synthetic and real-world case studies and discussing the advantages and limitations of the proposed approach.
翻译:在监督学习中,在学习发生之前自动评估标签质量仍是一个开放的研究问题。在某些特定情形下,已提出假设检验方法用于评估给定实例-标签数据集是否受到类条件标签噪声(而非均匀标签噪声)的污染。现有理论建立在参数逻辑回归的最大似然估计的渐近性质之上。然而,构建这些方法所依赖的参数假设在实践中往往过于严格且不切实际。为缓解该问题,本文提出另一条路径,通过展示当底层模型为局部最大似然估计的产物时,如何遵循类似流程,从而得到更灵活的非参数逻辑回归模型,这些模型对模型误设的敏感性较低。这种不同视角通过为用户提供更丰富的模型类别,拓展了检验的适用范围。与现有工作类似,我们假设可访问用户提供的锚点。我们引入将假设检验适配至非参数逻辑回归情境的必要要素,并通过合成数据与真实案例研究,与参数方法进行实证对比,讨论所提方法的优势与局限。