Developers often perform repetitive code editing activities for various reasons (e.g., code refactoring) during software development. Pre-trained code editing models have achieved the state-of-the-art (SOTA) results. Pre-trained models are first pre-trained with pre-training tasks and fine-tuned with the code editing task. Existing pre-training tasks mainly are code infilling tasks (e.g., masked language modeling), which are derived from the natural language processing field and are not designed for automatic code editing. This paper proposes a novel pre-training task specialized in code editing and presents an effective pre-trained code editing model named CodeEditor. Our pre-training task further improves the performance and generalization ability of code editing models. Specifically, we collect lots of real-world code snippets as the ground truth and use a powerful generator to rewrite them into mutated versions. Then, we pre-train our CodeEditor to edit mutated versions into the corresponding ground truth, to learn edit patterns. We conduct experiments on four code editing datasets and evaluate the pre-trained CodeEditor in three settings. (1) In the fine-tuning setting, we train the pre-trained CodeEditor with four datasets and evaluate it on the test data. CodeEditor outperforms the SOTA baselines by 15%, 25.5%, and 9.4% and 26.6% on four datasets. (2) In the few-shot setting, we train the pre-trained CodeEditor with limited data and evaluate it on the test data. CodeEditor substantially performs better than all baselines. (3) In the zero-shot setting, CodeEditor correctly edits 1,113 programs while the SOTA baselines can not work.
翻译:[translated abstract in Chinese]
开发人员在软件开发过程中常常因各种原因(例如代码重构)执行重复性的代码编辑活动。预训练代码编辑模型已取得当前最优(SOTA)结果。这类模型首先通过预训练任务进行预训练,再通过代码编辑任务进行微调。现有预训练任务主要是源自自然语言处理领域的代码填充任务(例如掩码语言建模),并非专为自动代码编辑设计。本文提出了一种专用于代码编辑的新型预训练任务,并构建了一个名为CodeEditor的高效预训练代码编辑模型。我们的预训练任务进一步提升了代码编辑模型的性能与泛化能力。具体而言,我们收集大量真实代码片段作为基准真相,并使用强大的生成器将其改写为变异版本。随后,我们预训练CodeEditor模型将变异版本编辑为对应的基准真相,以学习编辑模式。我们在四个代码编辑数据集上开展实验,并在三种设置下评估预训练的CodeEditor模型:(1)在微调设置中,我们使用四个数据集训练预训练后的CodeEditor,并在测试数据上进行评估,CodeEditor在四个数据集上分别比SOTA基线模型提升15%、25.5%、9.4%和26.6%;(2)在小样本设置中,我们使用有限数据训练预训练后的CodeEditor并在测试数据上评估,CodeEditor的性能显著优于所有基线模型;(3)在零样本设置中,CodeEditor能够正确编辑1,113个程序,而SOTA基线模型无法工作。