Driven by the surge in code generation using large language models (LLMs), numerous benchmarks have emerged to evaluate these LLMs capabilities. We conducted a large-scale human evaluation of HumanEval and MBPP, two popular benchmarks for Python code generation, analyzing their diversity and difficulty. Our findings unveil a critical bias towards a limited set of programming concepts, neglecting most of the other concepts entirely. Furthermore, we uncover a worrying prevalence of easy tasks, potentially inflating model performance estimations. To address these limitations, we propose a novel benchmark, PythonSaga, featuring 185 hand-crafted prompts on a balanced representation of 38 programming concepts across diverse difficulty levels.
翻译:受大语言模型在代码生成领域迅猛发展的推动,大量基准测试应运而生以评估这些模型的能力。我们对Python代码生成领域的两个热门基准测试——HumanEval和MBPP进行了大规模人工评估,从多样性与难度两个维度展开分析。研究发现,现有基准在编程概念覆盖上存在显著偏差,过度聚焦于少量概念,而完全忽略了绝大多数其他概念。此外,我们还发现了任务难度普遍偏低的隐患,这可能导致模型性能评估结果虚高。为攻克这些局限,我们提出了全新基准测试PythonSaga,包含185条精心设计的人工编写提示,覆盖38个编程概念的均衡分布,并呈现多样化的难度层级。