Recent advancements in large language models (LLMs) have greatly improved code generation, specifically at the function level. For instance, GPT-4o has achieved a 91.0\% pass rate on HumanEval. However, this draws into question the adequacy of existing benchmarks in thoroughly assessing function-level code generation capabilities. Our study analyzed two common benchmarks, HumanEval and MBPP, and found that these might not thoroughly evaluate LLMs' code generation capacities due to limitations in quality, difficulty, and granularity. To resolve this, we introduce the Mostly Hard Python Problems (MHPP) dataset, consisting of 210 unique human-curated problems. By focusing on the combination of natural language and code reasoning, MHPP gauges LLMs' abilities to comprehend specifications and restrictions, engage in multi-step reasoning, and apply coding knowledge effectively. Initial evaluations of 26 LLMs using MHPP showed many high-performing models on HumanEval failed to achieve similar success on MHPP. Moreover, MHPP highlighted various previously undiscovered limitations within various LLMs, leading us to believe that it could pave the way for a better understanding of LLMs' capabilities and limitations. MHPP, evaluation pipeline, and leaderboard can be found in https://github.com/SparksofAGI/MHPP.
翻译:近年来,大型语言模型(LLMs)在代码生成方面取得了显著进展,尤其是在函数级别。例如,GPT-4o在HumanEval基准测试中达到了91.0%的通过率。然而,这引发了人们对现有基准测试是否足以全面评估函数级代码生成能力的质疑。本研究分析了HumanEval和MBPP这两个常用基准,发现由于在质量、难度和粒度方面的局限性,它们可能无法全面评估LLMs的代码生成能力。为解决此问题,我们引入了“主要困难Python问题”(Mostly Hard Python Problems, MHPP)数据集,该数据集包含210个独特的人工筛选问题。通过聚焦自然语言与代码推理的结合,MHPP旨在衡量LLMs理解规范与约束、进行多步推理以及有效应用编码知识的能力。使用MHPP对26个LLMs进行的初步评估表明,许多在HumanEval上表现优异的模型在MHPP上未能取得相似的成功。此外,MHPP揭示了各种LLMs中先前未被发现的多种局限性,这使我们相信它能为更好地理解LLMs的能力与局限铺平道路。MHPP数据集、评估流程及排行榜可在https://github.com/SparksofAGI/MHPP 获取。