With the emergence of Large Language Models (LLMs), there has been a significant improvement in the programming capabilities of models, attracting growing attention from researchers. We propose CodeApex, a bilingual benchmark dataset focusing on the programming comprehension and code generation abilities of LLMs. CodeApex comprises three types of multiple-choice questions: conceptual understanding, commonsense reasoning, and multi-hop reasoning, designed to evaluate LLMs on programming comprehension tasks. Additionally, CodeApex utilizes algorithmic questions and corresponding test cases to assess the code quality generated by LLMs. We evaluate 14 state-of-the-art LLMs, including both general-purpose and specialized models. GPT exhibits the best programming capabilities, achieving approximate accuracies of 50% and 56% on the two tasks, respectively. There is still significant room for improvement in programming tasks. We hope that CodeApex can serve as a reference for evaluating the coding capabilities of LLMs, further promoting their development and growth. Datasets are released at \url{https://github.com/APEXLAB/CodeApex.git}. CodeApex submission website is \url{https://apex.sjtu.edu.cn/codeapex/}.
翻译:随着大语言模型(LLMs)的出现,模型的编程能力得到了显著提升,日益受到研究者的关注。我们提出了CodeApex,一个专注于评估LLMs编程理解与代码生成能力的双语基准数据集。CodeApex包含三类多项选择题目:概念理解、常识推理和多跳推理,旨在评估LLMs在编程理解任务上的表现。此外,CodeApex利用算法题及其对应的测试用例来评估LLMs生成代码的质量。我们评估了14个目前最先进的LLMs,包括通用模型和专用模型。GPT展现出最佳的编程能力,在两项任务上分别达到约50%和56%的准确率。编程任务仍有显著的提升空间。我们希望CodeApex能作为评估LLMs编码能力的参考基准,进一步推动其发展。数据集已发布于 \url{https://github.com/APEXLAB/CodeApex.git}。CodeApex提交网站为 \url{https://apex.sjtu.edu.cn/codeapex/}。