Pre-trained code models lead the era of code intelligence with multiple models have been designed with impressive performance. However, one important problem, data augmentation for code data that automatically helps developers prepare training data lacks study in this field. In this paper, we introduce a generic data augmentation framework, GenCode, to enhance the training of code understanding models. Simply speaking, GenCode follows a generation-and-selection paradigm to prepare useful training code data. Specifically, it employs code transformation techniques to generate new code candidates first and then selects important ones as the training data by importance metrics. To evaluate the effectiveness of GenCode, we conduct experiments on four code understanding tasks (e.g., code clone detection) and three pre-trained code models (e.g., CodeT5). Compared to the state-of-the-art (SOTA) code augmentation method, MixCode, GenCode produces code models with 2.92% higher accuracy and 4.90% robustness on average.
翻译:预训练代码模型引领了代码智能的时代,多种模型的设计都展现出令人印象深刻的性能。然而,该领域缺乏一个重要问题的研究,即针对代码数据的、能自动帮助开发者准备训练数据的数据增强技术。本文介绍了一种通用的数据增强框架——GenCode,以增强代码理解模型的训练。简而言之,GenCode遵循一种生成-选择范式来准备有用的训练代码数据。具体来说,它首先采用代码转换技术生成新的候选代码,然后通过重要性度量指标筛选出重要的代码作为训练数据。为了评估GenCode的有效性,我们在四个代码理解任务(例如,代码克隆检测)和三个预训练代码模型(例如,CodeT5)上进行了实验。与最先进的代码增强方法MixCode相比,GenCode生成的代码模型平均准确率提高了2.92%,鲁棒性提高了4.90%。