Pre-trained code models lead the era of code intelligence, with multiple models designed with impressive performance. However, one important problem, data augmentation for code data that automatically helps developers prepare training data lacks study in this field. In this paper, we introduce a generic data augmentation framework, GenCode, to enhance the training of code understanding models. Simply speaking, GenCode follows a generation-and-selection paradigm to prepare useful training code data. Specifically, it employs code augmentation techniques to generate new code candidates first and then identifies important ones as the training data by influence scores. To evaluate the effectiveness of GenCode, we conduct experiments on four code understanding tasks (e.g., code clone detection) and three pre-trained code models (e.g., CodeT5) and two recent released code-specific Large Language Models (LLMs) (e.g., Qwen2.5-Coder). Compared to the state-of-the-art (SOTA) code augmentation method MixCode, GenCode produces pre-trained code models with 2.92% higher accuracy and 4.90% adversarial robustness on average. For code-specific LLMs, GenCode achieves an average improvement of 0.93% in accuracy and 0.98% in natural robustness.
翻译:预训练代码模型引领了代码智能的时代,多种模型设计展现出卓越性能。然而,该领域缺乏对代码数据增强这一关键问题的研究,即如何自动帮助开发者准备训练数据。本文提出一种通用数据增强框架GenCode,以增强代码理解模型的训练。简而言之,GenCode遵循生成-选择范式来准备有效的训练代码数据。具体而言,它首先采用代码增强技术生成新的候选代码,然后通过影响力评分识别重要样本作为训练数据。为评估GenCode的有效性,我们在四个代码理解任务(如代码克隆检测)和三个预训练代码模型(如CodeT5)以及两个最新发布的代码专用大语言模型(如Qwen2.5-Coder)上进行了实验。与最先进的代码增强方法MixCode相比,GenCode使预训练代码模型的准确率平均提升2.92%,对抗鲁棒性平均提升4.90%。对于代码专用大语言模型,GenCode在准确率上平均提升0.93%,在自然鲁棒性上平均提升0.98%。