Driven by the surge in code generation using large language models (LLMs), numerous benchmarks have emerged to evaluate these LLMs capabilities. We conducted a large-scale human evaluation of HumanEval and MBPP, two popular benchmarks for Python code generation, analyzing their diversity and difficulty. Our findings unveil a critical bias towards a limited set of programming concepts, neglecting most of the other concepts entirely. Furthermore, we uncover a worrying prevalence of easy tasks, potentially inflating model performance estimations. To address these limitations, we propose a novel benchmark, PythonSaga, featuring 185 hand-crafted prompts on a balanced representation of 38 programming concepts across diverse difficulty levels.
翻译:随着使用大型语言模型(LLM)进行代码生成的兴起,众多基准测试应运而生以评估这些LLM的能力。我们对Python代码生成领域两个流行的基准测试——HumanEval和MBPP——进行了大规模人工评估,分析了其多样性和难度。研究发现,这些基准测试存在严重偏向于有限编程概念的问题,几乎完全忽略了其他大多数概念。此外,我们还发现其中充斥着大量简单任务,这可能会虚增模型性能的评估结果。为解决这些局限,我们提出了一个新颖的基准测试——PythonSaga,该基准包含185个手工设计的提示,覆盖38个编程概念的均衡表示,并涵盖不同难度层级。