Pre-trained code models lead the era of code intelligence. Many models have been designed with impressive performance recently. However, one important problem, data augmentation for code data that automatically helps developers prepare training data lacks study in the field of code learning. In this paper, we introduce a general data augmentation framework, GenCode, to enhance the training of code understanding models. GenCode follows a generation-and-selection paradigm to prepare useful training codes. Specifically, it uses code transformation techniques to generate new code candidates first and then selects important ones as the training data by importance metrics. To evaluate the effectiveness of GenCode with a general importance metric -- loss value, we conduct experiments on four code understanding tasks (e.g., code clone detection) and three pre-trained code models (e.g., CodeT5). Compared to the state-of-the-art (SOTA) code augmentation method, MixCode, GenCode produces code models with 2.92% higher accuracy and 4.90% robustness on average.
翻译:预训练代码模型引领了代码智能时代。近年来,许多模型凭借出色的性能被设计出来。然而,在代码学习领域,一个关键问题——即通过数据增强自动帮助开发人员准备训练数据的代码数据增强方法——仍缺乏研究。本文提出了一种通用数据增强框架GenCode,用于增强代码理解模型的训练。该框架遵循"生成-筛选"范式来准备有效的训练代码:首先利用代码变换技术生成新的候选代码,再通过重要性指标筛选出重要数据作为训练集。为评估采用通用重要性指标(损失值)的GenCode有效性,我们在四个代码理解任务(如代码克隆检测)和三个预训练代码模型(如CodeT5)上进行了实验。与当前最优的代码增强方法MixCode相比,GenCode生成的代码模型准确率平均提升2.92%,鲁棒性平均提升4.90%。