General large language models (LLMs), represented by ChatGPT, have demonstrated significant potential in tasks such as code generation in software engineering. This has led to the development of specialized LLMs for software engineering, known as Code LLMs. A considerable portion of Code LLMs is derived from general LLMs through model fine-tuning. As a result, Code LLMs are often updated frequently and their performance can be influenced by the base LLMs. However, there is currently a lack of systematic investigation into Code LLMs and their performance. In this study, we conduct a comprehensive survey and analysis of the types of Code LLMs and their differences in performance compared to general LLMs. We aim to address three questions: (1) What LLMs are specifically designed for software engineering tasks, and what is the relationship between these Code LLMs? (2) Do Code LLMs really outperform general LLMs in software engineering tasks? (3) Which LLMs are more proficient in different software engineering tasks? To answer these questions, we first collect relevant literature and work from five major databases and open-source communities, resulting in 134 works for analysis. Next, we categorize the Code LLMs based on their publishers and examine their relationships with general LLMs and among themselves. Furthermore, we investigate the performance differences between general LLMs and Code LLMs in various software engineering tasks to demonstrate the impact of base models and Code LLMs. Finally, we comprehensively maintained the performance of LLMs across multiple mainstream benchmarks to identify the best-performing LLMs for each software engineering task. Our research not only assists developers of Code LLMs in choosing base models for the development of more advanced LLMs but also provides insights for practitioners to better understand key improvement directions for Code LLMs.
翻译:以ChatGPT为代表的通用大语言模型(LLMs)在软件工程任务(如代码生成)中展现出显著潜力,推动了面向软件工程的专用LLMs(即代码LLMs)的发展。相当比例的代码LLMs通过通用LLMs微调实现,因此其版本更新频繁且性能受基础LLMs影响。然而,目前缺乏对代码LLMs及其性能的系统性研究。本研究对代码LLMs的类型及其与通用LLMs的性能差异进行了全面调查与分析,旨在探讨三个核心问题:(1)哪些LLMs专为软件工程任务设计,这些代码LLMs之间存在何种关联?(2)代码LLMs在软件工程任务中是否真正优于通用LLMs?(3)不同软件工程任务中哪些LLMs表现更优?为回答这些问题,我们首先从五大数据库及开源社区收集相关文献,最终纳入134项工作进行分析。其次,基于发布者对代码LLMs进行分类,并探讨其与通用LLMs及同类模型间的关联。随后,通过分析通用LLMs与代码LLMs在多种软件工程任务中的性能差异,揭示基础模型与代码LLMs的影响。最后,系统维护了LLMs在多个主流基准测试中的表现,以确立每项软件工程任务中的最佳LLMs。本研究不仅为代码LLMs开发者选择基础模型以开发更先进LLMs提供指导,亦为从业者理解代码LLMs的关键改进方向提供启示。