定向扩散式代码编辑预训练 (Directional Diffusion-Style Code Editing Pre-training)

Code pre-trained models have shown promising effectiveness in various software engineering tasks. Among these tasks, many tasks are related to software evolution and/or code editing. However, existing code pre-trained models often overlook the real-world code editing data and the evolutionary nature of the editing process. In this paper, to simulate the step-by-step code editing process of human developers, we propose DivoT5, a pre-trained model based on directional diffusion at the data level. In DivoT5, we adopt two categories of pre-training tasks. The first category is mask and denoising tasks augmented with a diffusion direction representing code evolution. That is, we first apply a noising process to the code snippets before evolution, and then ask the pre-training process to restore the snippets with noise into the code snippets after evolution. The second category is tasks aiming to reinforce the evolutionary direction. That is, we first generate various intermediate versions for each pair of snippets before and after evolution, and then ask the pre-training process to transform the intermediate versions into the snippet after evolution for each pair. We evaluate DivoT5 for two code-editing scenarios and one non-editing scenario using five downstream tasks. Given each downstream task, we fine-tune the pre-trained DivoT5 to evaluate its effectiveness. Our experimental results show that DivoT5 achieves state-of-the-art (SOTA) performance on most tasks in comparison to models of the same scale (220M), large scale (770M) models in fine-tuning, and billion-scale (6.7B, 8B, ChatGPT) models in few-shot settings. For one code-editing task (i.e., automated code review), DivoT5 pre-trained on top of CodeT5-small (60M) can even outperform CodeT5-base (220M) and other pre-trained models with 220M parameters except for DivoT5 pre-trained on top of CodeT5-base (220M).

翻译：代码预训练模型已在多种软件工程任务中展现出良好的有效性。在这些任务中，许多任务与软件演化及/或代码编辑相关。然而，现有的代码预训练模型往往忽视了真实世界的代码编辑数据以及编辑过程的演化特性。在本文中，为模拟人类开发者逐步进行代码编辑的过程，我们提出了DivoT5，一种在数据层面基于定向扩散的预训练模型。在DivoT5中，我们采用了两类预训练任务。第一类是掩码与去噪任务，其通过代表代码演化的扩散方向进行增强。即，我们首先对演化前的代码片段施加噪声化过程，然后要求预训练过程将含噪声的片段恢复为演化后的代码片段。第二类任务是旨在强化演化方向的任务。即，我们首先为每个演化前后的片段对生成多种中间版本，然后要求预训练过程将中间版本转换为每个片段对中演化后的片段。我们使用五个下游任务，在两个代码编辑场景和一个非编辑场景中评估DivoT5。针对每个下游任务，我们对预训练的DivoT5进行微调以评估其有效性。实验结果表明，与同规模（220M）模型、微调下的大规模（770M）模型以及少样本设置下的十亿规模（6.7B、8B、ChatGPT）模型相比，DivoT5在大多数任务上实现了最先进的性能。对于一项代码编辑任务（即自动化代码审查），基于CodeT5-small（60M）预训练的DivoT5甚至能够超越CodeT5-base（220M）及其他220M参数的预训练模型（基于CodeT5-base预训练的DivoT5除外）。