Can OpenSource beat ChatGPT? -- A Comparative Study of Large Language Models for Text-to-Code Generation

In recent years, large language models (LLMs) have emerged as powerful tools with potential applications in various fields, including software engineering. Within the scope of this research, we evaluate five different state-of-the-art LLMs - Bard, BingChat, ChatGPT, Llama2, and Code Llama - concerning their capabilities for text-to-code generation. In an empirical study, we feed prompts with textual descriptions of coding problems sourced from the programming website LeetCode to the models with the task of creating solutions in Python. Subsequently, the quality of the generated outputs is assessed using the testing functionalities of LeetCode. The results indicate large differences in performance between the investigated models. ChatGPT can handle these typical programming challenges by far the most effectively, surpassing even code-specialized models like Code Llama. To gain further insights, we measure the runtime as well as the memory usage of the generated outputs and compared them to the other code submissions on Leetcode. A detailed error analysis, encompassing a comparison of the differences concerning correct indentation and form of the generated code as well as an assignment of the incorrectly solved tasks to certain error categories allows us to obtain a more nuanced picture of the results and potential for improvement. The results also show a clear pattern of increasingly incorrect produced code when the models are facing a lot of context in the form of longer prompts.

翻译：近年来，大型语言模型（LLMs）已成为具有广泛应用前景的强大工具，其在软件工程领域亦展现出巨大潜力。本研究评估了五种不同的前沿大型语言模型——Bard、BingChat、ChatGPT、Llama2和Code Llama——在文本到代码生成任务上的能力。我们通过实证研究，将编程网站LeetCode上以文本描述形式呈现的编程问题作为提示输入给这些模型，要求其生成Python解决方案。随后，利用LeetCode的测试功能对生成代码的质量进行评估。结果表明，所研究模型之间的性能存在显著差异。ChatGPT在处理这类典型编程挑战时表现最为出色，甚至超越了Code Llama等代码专用模型。为进一步深入分析，我们测量了生成代码的运行时间和内存占用，并与LeetCode平台上的其他代码提交进行对比。通过详细的错误分析——包括对生成代码的正确缩进与格式差异的比较，以及将错误解决的任务归类至特定错误类别——使我们能够更细致地理解实验结果及其改进潜力。研究结果还显示，当模型面对包含大量上下文的长提示时，生成代码的错误率呈现明显上升趋势。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/