Boosting Source Code Learning with Data Augmentation: An Empirical Study

The next era of program understanding is being propelled by the use of machine learning to solve software problems. Recent studies have shown surprising results of source code learning, which applies deep neural networks (DNNs) to various critical software tasks, e.g., bug detection and clone detection. This success can be greatly attributed to the utilization of massive high-quality training data, and in practice, data augmentation, which is a technique used to produce additional training data, has been widely adopted in various domains, such as computer vision. However, in source code learning, data augmentation has not been extensively studied, and existing practice is limited to simple syntax-preserved methods, such as code refactoring. Essentially, source code is often represented in two ways, namely, sequentially as text data and structurally as graph data, when it is used as training data in source code learning. Inspired by these analogy relations, we take an early step to investigate whether data augmentation methods that are originally used for text and graphs are effective in improving the training quality of source code learning. To that end, we first collect and categorize data augmentation methods in the literature. Second, we conduct a comprehensive empirical study on four critical tasks and 11 DNN architectures to explore the effectiveness of 12 data augmentation methods (including code refactoring and 11 other methods for text and graph data). Our results identify the data augmentation methods that can produce more accurate and robust models for source code learning, including those based on mixup (e.g., SenMixup for texts and Manifold-Mixup for graphs), and those that slightly break the syntax of source code (e.g., random swap and random deletion for texts).

翻译：程序理解的下一时代正由机器学习解决软件问题所推动。近期研究展示了源代码学习的惊人成果，该方法将深度神经网络（DNNs）应用于多种关键软件任务，如缺陷检测和克隆检测。这一成功很大程度上归功于海量高质量训练数据的利用，而在实践中，数据增强（一种用于生成额外训练数据的技术）已在计算机视觉等多个领域广泛采用。然而，在源代码学习领域，数据增强尚未得到充分研究，现有实践仅限于简单的语法保持方法，如代码重构。本质上，源代码在作为训练数据时通常以两种方式表示：一是作为文本数据的序列形式，二是作为图数据的结构化形式。受这些类比关系的启发，我们率先探究原本用于文本和图的数据增强方法是否能有效提升源代码学习的训练质量。为此，我们首先收集并分类文献中的数据增强方法。其次，我们针对四项关键任务和11种DNN架构开展全面实证研究，探索12种数据增强方法（包括代码重构及其他11种用于文本和图的方法）的有效性。研究结果识别出了能够为源代码学习生成更准确、更鲁棒模型的数据增强方法，包括基于mixup的方法（如用于文本的SenMixup和用于图的Manifold-Mixup），以及轻微破坏源代码语法的方法（如用于文本的随机交换和随机删除）。