Many Transformer-based pre-trained models for code have been developed and applied to code-related tasks. In this paper, we review the existing literature, examine the suitability of model architectures for different tasks, and look at the generalization ability of models on different datasets, and their resource consumption. We examine three very representative pre-trained models for code: CodeBERT, CodeGPT, and CodeT5, and conduct experiments on the top-4 most targeted software engineering tasks that we found in our literature survey: Code Summarization, Bug Fixing, Bug Detection, and Code Search. In our study, we showcase the capability of decoder-only models (CodeGPT) for specific generation tasks under state-of-the-art evaluation metrics and contest the common belief that the encoder-decoder architecture is optimal for general-purpose coding tasks. Additionally, we found that the most frequently used models are not necessarily the most suitable for certain applications and the developers' needs are not adequately addressed by current research. As well, we found that the benchmark and frequent dataset for Bug Fixing and Code Summarization both fail to enable models to generalize onto other datasets for the same task (the frequent dataset refers to the dataset with the highest frequency used in literature other than the benchmark). We use statistical testing to support our conclusions from experiments. Finally, CodeBERT is highly efficient for understanding tasks, whereas CodeT5's efficiency for generation tasks is in doubt, as the highest resource consumption does not guarantee a consistent better performance on different metrics. We also discuss the numerous practical issues in advancing future research on transformer-based models for code-related tasks.
翻译:针对代码任务的基于Transformer的预训练模型已得到广泛开发与应用。本文通过系统文献综述,考察不同模型架构对各类任务的适配性、模型在不同数据集上的泛化能力及其资源消耗。我们选取了三种最具代表性的代码预训练模型:CodeBERT、CodeGPT和CodeT5,并在文献调查中发现的四个最受关注的软件工程任务(代码摘要、缺陷修复、缺陷检测、代码搜索)上开展实验。研究表明,在最新评估指标下,纯解码器架构模型(CodeGPT)在特定生成任务中展现出卓越性能,挑战了"编码器-解码器架构是通用代码任务最优选择"的传统认知。此外,我们发现使用频率最高的模型未必最适合特定应用场景,且现有研究未能充分满足实际开发者需求。同时,缺陷修复和代码摘要任务的基准测试及高频数据集(指除基准数据集外文献中使用频率最高的数据集)均无法使模型实现同任务跨数据集的有效泛化。我们采用统计检验方法对实验结论进行验证。最后分析表明,CodeBERT在理解型任务中效能极高,而CodeT5在生成任务中的效率存疑——最高资源消耗并未确保其在各评估指标上持续获得更优表现。我们还探讨了推进基于Transformer的代码任务研究所面临的诸多实践挑战。