Existing research on learning with noisy labels predominantly focuses on synthetic label noise. Although synthetic noise possesses well-defined structural properties, it often fails to accurately replicate real-world noise patterns. In recent years, there has been a concerted effort to construct generalizable and controllable instance-dependent noise datasets for image classification, significantly advancing the development of noise-robust learning in this area. However, studies on noisy label learning for text classification remain scarce. To better understand label noise in real-world text classification settings, we constructed the benchmark dataset NoisyAG-News through manual annotation. Initially, we analyzed the annotated data to gather observations about real-world noise. We qualitatively and quantitatively demonstrated that real-world noisy labels adhere to instance-dependent patterns. Subsequently, we conducted comprehensive learning experiments on NoisyAG-News and its corresponding synthetic noise datasets using pre-trained language models and noise-handling techniques. Our findings reveal that while pre-trained models are resilient to synthetic noise, they struggle against instance-dependent noise, with samples of varying confusion levels showing inconsistent performance during training and testing. These real-world noise patterns pose new, significant challenges, prompting a reevaluation of noisy label handling methods. We hope that NoisyAG-News will facilitate the development and evaluation of future solutions for learning with noisy labels.
翻译:现有关于带噪声标签学习的研究主要集中于合成标签噪声。尽管合成噪声具有明确的结构特性,但它往往无法准确复现实世界中的噪声模式。近年来,学界致力于为图像分类构建可泛化且可控的实例相关噪声数据集,显著推动了该领域噪声鲁棒学习的发展。然而,针对文本分类的带噪声标签学习研究仍然匮乏。为了更好地理解现实世界文本分类场景中的标签噪声,我们通过人工标注构建了基准数据集NoisyAG-News。首先,我们分析了标注数据以收集关于现实世界噪声的观察结果。我们定性和定量地证明了现实世界中的噪声标签遵循实例相关模式。随后,我们使用预训练语言模型和噪声处理技术,在NoisyAG-News及其对应的合成噪声数据集上进行了全面的学习实验。我们的研究结果表明,尽管预训练模型对合成噪声具有鲁棒性,但它们难以应对实例相关噪声,且不同混淆程度的样本在训练和测试期间表现出不一致的性能。这些现实世界的噪声模式带来了新的重大挑战,促使我们重新评估带噪声标签的处理方法。我们希望NoisyAG-News能够促进未来带噪声标签学习解决方案的开发与评估。