With the emergence of Large Language Models (LLMs), there has been a significant improvement in the programming capabilities of models, attracting growing attention from researchers. Evaluating the programming capabilities of LLMs is crucial as it reflects the multifaceted abilities of LLMs, and it has numerous downstream applications. In this paper, we propose CodeApex, a bilingual benchmark dataset focusing on the programming comprehension, code generation, and code correction abilities of LLMs. Programming comprehension task tests LLMs on multiple-choice exam questions covering conceptual understanding, commonsense reasoning, and multi-hop reasoning. The code generation task evaluates LLMs through completing C++ functions based on provided descriptions and prototypes. The code correction task asks LLMs to fix real-world erroneous code segments with different error messages. We evaluate 12 widely used LLMs, including both general-purpose and specialized models. GPT-4 exhibits the best programming capabilities, achieving approximate accuracy of 69%, 54%, and 66% on the three tasks, respectively. Compared to human performance, there is still significant room for improvement in LLM programming. We hope that CodeApex can serve as a reference for evaluating the coding capabilities of LLMs, further promoting their development and growth.
翻译:随着大语言模型(LLMs)的出现,模型的编程能力显著提升,引发了研究者的广泛关注。评估LLMs的编程能力至关重要,这不仅能反映模型的多方面能力,还具有大量下游应用价值。本文提出CodeApex——一个专注于LLMs编程理解、代码生成和代码修正能力的双语基准数据集。编程理解任务通过涵盖概念理解、常识推理与多跳推理的选择题测试LLMs;代码生成任务要求模型根据提供的描述和函数原型完成C++函数编写;代码修正任务则要求LLMs修复包含不同错误信息的真实错误代码片段。我们对12种主流LLMs(包括通用模型与专用模型)进行了评估。GPT-4展现出最优编程能力,在三个任务上分别达到约69%、54%和66%的准确率。与人类水平相比,LLMs的编程能力仍有显著提升空间。我们希望CodeApex能够成为评估LLMs编程能力的参考基准,进一步推动其发展与进步。