Large-scale datasets in the real world inevitably involve label noise. Deep models can gradually overfit noisy labels and thus degrade model generalization. To mitigate the effects of label noise, learning with noisy labels (LNL) methods are designed to achieve better generalization performance. Due to the lack of suitable datasets, previous studies have frequently employed synthetic label noise to mimic real-world label noise. However, synthetic noise is not instance-dependent, making this approximation not always effective in practice. Recent research has proposed benchmarks for learning with real-world noisy labels. However, the noise sources within may be single or fuzzy, making benchmarks different from data with heterogeneous label noises in the real world. To tackle these issues, we contribute NoisywikiHow, the largest NLP benchmark built with minimal supervision. Specifically, inspired by human cognition, we explicitly construct multiple sources of label noise to imitate human errors throughout the annotation, replicating real-world noise, whose corruption is affected by both ground-truth labels and instances. Moreover, we provide a variety of noise levels to support controlled experiments on noisy data, enabling us to evaluate LNL methods systematically and comprehensively. After that, we conduct extensive multi-dimensional experiments on a broad range of LNL methods, obtaining new and intriguing findings.
翻译:现实世界中的大规模数据集不可避免地包含标签噪声。深度模型会逐渐过拟合噪声标签,从而降低模型泛化能力。为减轻标签噪声的影响,噪声标签学习(LNL)方法旨在实现更好的泛化性能。由于缺乏合适的数据集,以往研究常采用合成标签噪声模拟真实世界的标签噪声。但合成噪声不具有实例依赖性,导致这种近似在实践中并不总是有效。近年来研究提出了面向真实世界噪声标签学习的基准,但其噪声来源可能单一或模糊,使这些基准与真实世界中具有异质标签噪声的数据存在差异。为解决这些问题,我们构建了NoisywikiHow——当前以最小监督构建的最大规模NLP基准。具体而言,受人类认知启发,我们通过显式构建多源标签噪声,模拟标注过程中的人类错误,从而复现受真实标签和实例共同影响的真实噪声污染。此外,我们提供多级噪声水平,支持对噪声数据进行受控实验,从而系统全面地评估LNL方法。随后,我们对广泛LNL方法开展了多维度的扩展实验,获得了新颖且有趣的研究发现。