Confident Learning: Estimating Uncertainty in Dataset Labels

Learning exists in the context of data, yet notions of \emph{confidence} typically focus on model predictions, not label quality. Confident learning (CL) is an alternative approach which focuses instead on label quality by characterizing and identifying label errors in datasets, based on the principles of pruning noisy data, counting with probabilistic thresholds to estimate noise, and ranking examples to train with confidence. Whereas numerous studies have developed these principles independently, here, we combine them, building on the assumption of a classification noise process to directly estimate the joint distribution between noisy (given) labels and uncorrupted (unknown) labels. This results in a generalized CL which is provably consistent and experimentally performant. We present sufficient conditions where CL exactly finds label errors, and show CL performance exceeds seven state-of-the-art approaches for learning with noisy labels on the CIFAR dataset. The CL framework is \emph{not} coupled to a specific data modality or model: we use CL to find errors in the presumed error-free MNIST dataset and improve sentiment classification on text data in Amazon Reviews. We also employ CL on ImageNet to quantify ontological class overlap (e.g. finding approximately 645 \emph{missile} images are mislabeled as their parent class \emph{projectile}), and moderately increase model accuracy (e.g. for ResNet) by cleaning data prior to training. These results are replicable using the open-source \texttt{cleanlab} release.

翻译：在数据背景下存在学习, 然而 emph{ 信任度的概念通常侧重于模型预测, 而不是标签质量。自信学习( CL) 是一种替代方法, 侧重于标签质量, 其依据的原则是: 运行噪音数据, 以概率阈值计以估计噪音, 以及以信心培训为例。虽然许多研究独立地发展了这些原则, 但在这里, 我们结合了这些原则。假设分类噪声进程, 直接估计噪音( given) 标签和未损坏的( 未知的) 标签之间的联合分布。这导致一个通用的 CL, 以可辨别的一致性和实验性性能来识别数据集中的标签错误和标签错误。我们提出了足够的条件, CL 确切地发现标签错误, 并显示 CL 的性能超过七个最先进的学习方法, 在 CIFAR 数据集中, CL 框架是模型{ nonot} 与特定的数据模式或模型: 我们使用 CL 来查找假设的无误数据和不为 NAl 的 6 类前数据重复性。我们使用 CL\ 的 CL\ brealreal dal review dal 。在亚马图中, 搜索中, 中, 搜索搜索搜索数据搜索。