Developers deal with code-change-related tasks daily, e.g., reviewing code. Pre-trained code and code-change-oriented models have been adapted to help developers with such tasks. Recently, large language models (LLMs) have shown their effectiveness in code-related tasks. However, existing LLMs for code focus on general code syntax and semantics rather than the differences between two code versions. Thus, it is an open question how LLMs perform on code-change-related tasks. To answer this question, we conduct an empirical study using \textgreater 1B parameters LLMs on three code-change-related tasks, i.e., code review generation, commit message generation, and just-in-time comment update, with in-context learning (ICL) and parameter-efficient fine-tuning (PEFT, including LoRA and prefix-tuning). We observe that the performance of LLMs is poor without examples and generally improves with examples, but more examples do not always lead to better performance. LLMs tuned with LoRA have comparable performance to the state-of-the-art small pre-trained models. Larger models are not always better, but \textsc{Llama~2} and \textsc{Code~Llama} families are always the best. The best LLMs outperform small pre-trained models on the code changes that only modify comments and perform comparably on other code changes. We suggest future work should focus more on guiding LLMs to learn the knowledge specific to the changes related to code rather than comments for code-change-related tasks.
翻译:开发人员日常处理与代码变更相关的任务,例如代码审查。预训练的代码及代码变更导向模型已被调整以协助开发人员完成此类任务。近期,大型语言模型(LLMs)在代码相关任务中展现出显著效能。然而,现有的代码专用LLMs主要关注通用代码语法与语义,而非两个代码版本之间的差异。因此,LLMs在代码变更相关任务上的表现仍是一个开放性问题。为回答此问题,我们使用参数量超过10亿的LLMs,通过上下文学习(ICL)与参数高效微调(PEFT,包括LoRA与前缀调优)方法,对三项代码变更相关任务(即代码审查生成、提交信息生成和即时注释更新)进行了实证研究。我们观察到,LLMs在无示例情况下表现较差,且性能通常随示例增加而提升,但更多示例并不总能带来更好效果。经LoRA微调的LLMs与当前最先进的小型预训练模型性能相当。更大规模的模型并非总是更优,但\textsc{Llama~2}与\textsc{Code~Llama}系列模型始终表现最佳。在仅修改注释的代码变更上,最优LLMs的表现超越小型预训练模型;在其他类型代码变更上,两者性能相当。我们建议未来研究应更侧重于引导LLMs学习代码变更本身(而非注释相关变更)的特定知识,以提升其在代码变更相关任务中的表现。