Pre-trained language models for code (PLMCs) have gained attention in recent research. These models are pre-trained on large-scale datasets using multi-modal objectives. However, fine-tuning them requires extensive supervision and is limited by the size of the dataset provided. We aim to improve this issue by proposing a simple data augmentation framework. Our framework utilizes knowledge gained during the pre-training and fine-tuning stage to generate pseudo data, which is then used as training data for the next step. We incorporate this framework into the state-of-the-art language models, such as CodeT5, CodeBERT, and UnixCoder. The results show that our framework significantly improves PLMCs' performance in code-related sequence generation tasks, such as code summarization and code generation in the CodeXGLUE benchmark.
翻译:预训练的代码语言模型(PLMCs)在近期的研究中备受关注。这些模型通过多模态目标在大规模数据集上进行预训练。然而,对它们进行微调需要大量监督,且受限于所提供数据集的规模。我们旨在通过提出一个简单的数据增强框架来改善这一问题。该框架利用在预训练和微调阶段获得的知识生成伪数据,随后这些伪数据被用作下一步的训练数据。我们将此框架融入最先进的语言模型,如CodeT5、CodeBERT和UnixCoder。结果表明,我们的框架显著提升了PLMCs在代码相关序列生成任务(例如CodeXGLUE基准测试中的代码摘要和代码生成)中的性能。