CodeApex: A Bilingual Programming Evaluation Benchmark for Large Language Models

Lingyue Fu,Huacan Chai,Shuang Luo,Kounianhua Du,Weiming Zhang,Longteng Fan,Jiayi Lei,Renting Rui,Jianghao Lin,Yuchen Fang,Yifan Liu,Jingkuan Wang,Siyuan Qi,Kangning Zhang,Weinan Zhang,Yong Yu

from arxiv, 33pages

With the emergence of Large Language Models (LLMs), there has been a significant improvement in the programming capabilities of models, attracting growing attention from researchers. Evaluating the programming capabilities of LLMs is crucial as it reflects the multifaceted abilities of LLMs, and it has numerous downstream applications. In this paper, we propose CodeApex, a bilingual benchmark dataset focusing on the programming comprehension, code generation, and code correction abilities of LLMs. Programming comprehension task tests LLMs on multiple-choice exam questions covering conceptual understanding, commonsense reasoning, and multi-hop reasoning. The code generation task evaluates LLMs through completing C++ functions based on provided descriptions and prototypes. The code correction task asks LLMs to fix real-world erroneous code segments with different error messages. We evaluate 12 widely used LLMs, including both general-purpose and specialized models. GPT-4 exhibits the best programming capabilities, achieving approximate accuracy of 69%, 54%, and 66% on the three tasks, respectively. Compared to human performance, there is still significant room for improvement in LLM programming. We hope that CodeApex can serve as a reference for evaluating the coding capabilities of LLMs, further promoting their development and growth.

翻译：随着大语言模型（LLMs）的出现，模型的编程能力显著提升，引发了研究者的广泛关注。评估LLMs的编程能力至关重要，这不仅能反映模型的多方面能力，还具有大量下游应用价值。本文提出CodeApex——一个专注于LLMs编程理解、代码生成和代码修正能力的双语基准数据集。编程理解任务通过涵盖概念理解、常识推理与多跳推理的选择题测试LLMs；代码生成任务要求模型根据提供的描述和函数原型完成C++函数编写；代码修正任务则要求LLMs修复包含不同错误信息的真实错误代码片段。我们对12种主流LLMs（包括通用模型与专用模型）进行了评估。GPT-4展现出最优编程能力，在三个任务上分别达到约69%、54%和66%的准确率。与人类水平相比，LLMs的编程能力仍有显著提升空间。我们希望CodeApex能够成为评估LLMs编程能力的参考基准，进一步推动其发展与进步。

相关内容

大语言模型

关注 66

大语言模型是基于海量文本数据训练的深度学习模型。它不仅能够生成自然语言文本，还能够深入理解文本含义，处理各种自然语言任务，如文本摘要、问答、翻译等。2023年，大语言模型及其在人工智能领域的应用已成为全球科技研究的热点，其在规模上的增长尤为引人注目，参数量已从最初的十几亿跃升到如今的一万亿。参数量的提升使得模型能够更加精细地捕捉人类语言微妙之处，更加深入地理解人类语言的复杂性。在过去的一年里，大语言模型在吸纳新知识、分解复杂任务以及图文对齐等多方面都有显著提升。随着技术的不断成熟，它将不断拓展其应用范围，为人类提供更加智能化和个性化的服务，进一步改善人们的生活和生产方式。

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日