Motivated by the increasing popularity of code generation from human descriptions using large language models (LLMs), several benchmarks have been proposed to assess the capabilities of existing and emerging models. This study presents a large-scale human evaluation of HumanEval and MBPP, two widely used benchmarks for Python code generation, focusing on their diversity and difficulty. Our findings reveal a significant bias towards a limited number of programming concepts, with negligible or no representation of most concepts. Additionally, we identify a concerningly high proportion of easy programming questions, potentially leading to an overestimation of model performance on code generation tasks.
翻译:随着基于大语言模型(LLMs)从人类描述生成代码的方式日益普及,多个基准测试已被提出以评估现有及新兴模型的能力。本研究对Python代码生成领域中两个广泛使用的基准——HumanEval和MBPP——进行了大规模人工评估,重点考察其多样性与难度。研究结果表明,这些基准存在显著的偏见:它们仅涵盖有限的编程概念,而大多数概念则很少或完全未被涉及。此外,我们还发现其中高比例的编程问题过于简单,这可能高估了模型在代码生成任务上的实际性能。