General large language models (LLMs), represented by ChatGPT, have demonstrated significant potential in tasks such as code generation in software engineering. This has led to the development of specialized LLMs for software engineering, known as Code LLMs. A considerable portion of Code LLMs is derived from general LLMs through model fine-tuning. As a result, Code LLMs are often updated frequently and their performance can be influenced by the base LLMs. However, there is currently a lack of systematic investigation into Code LLMs and their performance. In this study, we conduct a comprehensive survey and analysis of the types of Code LLMs and their differences in performance compared to general LLMs. We aim to address three questions: (1) What LLMs are specifically designed for software engineering tasks, and what is the relationship between these Code LLMs? (2) Do Code LLMs really outperform general LLMs in software engineering tasks? (3) Which LLMs are more proficient in different software engineering tasks? To answer these questions, we first collect relevant literature and work from five major databases and open-source communities, resulting in 134 works for analysis. Next, we categorize the Code LLMs based on their publishers and examine their relationships with general LLMs and among themselves. Furthermore, we investigate the performance differences between general LLMs and Code LLMs in various software engineering tasks to demonstrate the impact of base models and Code LLMs. Finally, we comprehensively maintained the performance of LLMs across multiple mainstream benchmarks to identify the best-performing LLMs for each software engineering task. Our research not only assists developers of Code LLMs in choosing base models for the development of more advanced LLMs but also provides insights for practitioners to better understand key improvement directions for Code LLMs.
翻译:以ChatGPT为代表的通用大语言模型(LLMs)已在软件工程任务(如代码生成)中展现出显著潜力,这催生了面向软件工程的专用LLMs,即代码大语言模型(Code LLMs)。相当数量的Code LLMs通过模型微调从通用LLMs衍生而来。因此,Code LLMs通常更新频繁,且其性能可能受到基座LLMs的影响。然而,目前对Code LLMs及其性能缺乏系统性研究。本文对Code LLMs的类型及其与通用LLMs的性能差异进行了全面调查与分析,旨在解答三个问题:(1)哪些LLMs专为软件工程任务设计,这些Code LLMs之间存在何种关联?(2)在软件工程任务中,Code LLMs是否真正优于通用LLMs?(3)在不同软件工程任务中,哪些LLMs更具优势?为回答这些问题,我们首先从五大数据库和开源社区收集相关文献与工作,共获得134项研究进行分析。随后,根据发布机构对Code LLMs进行分类,并考察其与通用LLMs及相互之间的关系。此外,我们探究通用LLMs与Code LLMs在各类软件工程任务中的性能差异,以揭示基座模型与Code LLMs的影响。最后,我们在多个主流基准测试上全面维护LLMs的性能数据,以识别每类软件工程任务中性能最佳的LLMs。本研究不仅有助于Code LLMs开发者选择基座模型以开发更先进的LLMs,也为实践者深入理解Code LLMs的关键改进方向提供了洞见。