CodeEval：一种针对代码训练大语言模型的目标评估教学法 (CodeEval: A pedagogical approach for targeted evaluation of code-trained Large Language Models)

from arxiv, Accepted at the International Joint Conference on Natural Language Processing & Asia-Pacific Chapter of the Association for Computational Linguistics, 2025. Will be published at ACL anthology

Large Language Models (LLMs) are predominantly assessed based on their common sense reasoning, language comprehension, and logical reasoning abilities. While models trained in specialized domains like mathematics or coding have demonstrated remarkable advancements in logical reasoning, there remains a significant gap in evaluating their code generation capabilities. Existing benchmark datasets fall short in pinpointing specific strengths and weaknesses, impeding targeted enhancements in models' reasoning abilities to synthesize code. To bridge this gap, our paper introduces an innovative, pedagogical benchmarking method that mirrors the evaluation processes encountered in academic programming courses. We introduce CodeEval, a multi-dimensional benchmark dataset designed to rigorously evaluate LLMs across 24 distinct aspects of Python programming. The dataset covers three proficiency levels - beginner, intermediate, and advanced - and includes both class-based and function-based problem types with detailed problem specifications and comprehensive test suites. To facilitate widespread adoption, we also developed RunCodeEval, an open-source execution framework that provides researchers with a ready-to-use evaluation pipeline for CodeEval. RunCodeEval handles test execution, context setup, and metrics generation, enabling researchers to quickly obtain detailed insights into model strengths and weaknesses across complexity levels, problem types, and programming categories. This combination enables targeted evaluation and guides improvements in LLMs' programming proficiencies.

翻译：大语言模型（LLMs）的评估主要基于其常识推理、语言理解和逻辑推理能力。尽管在数学或编程等专业领域训练的模型在逻辑推理方面展现出显著进步，但对其代码生成能力的评估仍存在明显不足。现有基准数据集难以精准定位具体优势与缺陷，阻碍了模型代码合成推理能力的针对性提升。为填补这一空白，本文提出一种创新的教学式基准测试方法，其设计理念模拟学术编程课程中的评估流程。我们推出了CodeEval——一个多维基准数据集，旨在从24个不同的Python编程维度对LLMs进行严格评估。该数据集涵盖初级、中级和高级三个能力层级，包含基于类和基于函数两种问题类型，并提供详细的问题描述与完整的测试套件。为促进广泛采用，我们还开发了开源执行框架RunCodeEval，为研究者提供即用型的CodeEval评估流程。RunCodeEval能够处理测试执行、上下文配置和指标生成，使研究者能够快速获取模型在不同复杂度层级、问题类型和编程类别上的优势与不足的详细洞见。这种组合方案实现了目标导向的评估，并为提升LLMs的编程能力提供了指导方向。