An Empirical Study on the Effectiveness of Noisy Label Learning for Program Understanding

Recently, deep learning models have been widely applied in program understanding tasks, and these models achieve state-of-the-art results on many benchmark datasets. A major challenge of deep learning for program understanding is that the effectiveness of these approaches depends on the quality of their datasets, and these datasets often contain noisy data samples. A typical kind of noise in program understanding datasets is label noises, which means that the target outputs for some inputs are mislabeled. Label noises may have a negative impact on the performance of deep learning models, so researchers have proposed various approaches to alleviate the impact of noisy labels, and formed a new research topic: noisy label learning (NLL). In this paper, we conduct an empirical study on the effectiveness of noisy label learning on deep learning for program understanding datasets. We evaluate various noisy label learning approaches and deep learning models on two tasks: program classification and code summarization. From the evaluation results, we find that the impact of label noise and NLL approaches on small deep learning models and large pre-trained models are different: small models are prone to label noises in program classification and NLL approaches can improve their robustness, while large pre-trained models are robust against label noises and NLL does not significantly improve their performances. On the other hand, NLL approaches have shown satisfying results in identifying noisy labeled samples for both tasks, indicating that these techniques can benefit researchers in building high-quality program understanding datasets.

翻译：近年来，深度学习模型被广泛应用于程序理解任务，并在多个基准数据集上取得了最先进的结果。深度学习在程序理解中的主要挑战在于，这些方法的有效性依赖于数据集的质量，而数据集中常包含噪声数据样本。程序理解数据集中一种典型的噪声是标签噪声，即某些输入的目标输出被错误标注。标签噪声可能对深度学习模型的性能产生负面影响，因此研究者提出了多种方法来缓解噪声标签的影响，并形成了新的研究方向：噪声标签学习（NLL）。本文通过实证研究探讨了噪声标签学习在深度学习程序理解数据集中的有效性。我们在程序分类和代码摘要两个任务上评估了多种噪声标签学习方法与深度学习模型。评估结果表明，标签噪声和NLL方法对小型深度学习模型和大型预训练模型的影响不同：小型模型在程序分类中易受标签噪声影响，而NLL方法能提升其鲁棒性；大型预训练模型对标签噪声具有鲁棒性，NLL并未显著改善其性能。另一方面，NLL方法在识别两个任务中的噪声标签样本方面取得了令人满意的结果，表明这些技术有助于研究者构建高质量的程序理解数据集。