With the emergence of Large Language Models (LLMs), there has been a significant improvement in the programming capabilities of models, attracting growing attention from researchers. We propose CodeApex, a bilingual benchmark dataset focusing on the programming comprehension and code generation abilities of LLMs. CodeApex comprises three types of multiple-choice questions: conceptual understanding, commonsense reasoning, and multi-hop reasoning, designed to evaluate LLMs on programming comprehension tasks. Additionally, CodeApex utilizes algorithmic questions and corresponding test cases to assess the code quality generated by LLMs. We evaluate 14 state-of-the-art LLMs, including both general-purpose and specialized models. GPT exhibits the best programming capabilities, achieving approximate accuracies of 50% and 56% on the two tasks, respectively. There is still significant room for improvement in programming tasks. We hope that CodeApex can serve as a reference for evaluating the coding capabilities of LLMs, further promoting their development and growth. Datasets are released at https://github.com/APEXLAB/CodeApex.git. CodeApex submission website is https://apex.sjtu.edu.cn/codeapex/.
翻译:随着大型语言模型(LLMs)的出现,模型的编程能力得到显著提升,这吸引了研究者越来越多的关注。我们提出CodeApex——一个专注于LLMs编程理解与代码生成能力的双语基准数据集。CodeApex包含三类多项选择题:概念理解、常识推理与多跳推理,旨在评估LLMs在编程理解任务上的表现。此外,CodeApex利用算法题及对应测试用例来评估LLMs生成的代码质量。我们评估了14个最先进的LLMs,涵盖通用型与专用型模型。GPT展现出最佳的编程能力,在这两项任务上分别达到约50%与56%的准确率。编程任务仍有较大的改进空间。我们期望CodeApex能为评估LLMs的编码能力提供参考,进一步推动其发展与进步。数据集发布在https://github.com/APEXLAB/CodeApex.git。CodeApex提交网站为https://apex.sjtu.edu.cn/codeapex/。