Large language models (LLMs) have achieved high accuracy, i.e., more than 90% pass@1, in solving Python coding problems in HumanEval and MBPP. Thus, a natural question is, whether LLMs achieve comparable code completion performance compared to human developers? Unfortunately, one cannot answer this question using existing manual crafted or simple (e.g., single-line) code generation benchmarks, since such tasks fail to represent real-world software development tasks. In addition, existing benchmarks often use poor code correctness metrics, providing misleading conclusions. To address these challenges, we create REPOCOD, a code generation benchmark with 980 problems collected from 11 popular real-world projects, with more than 58% of them requiring file-level or repository-level context information. In addition, REPOCOD has the longest average canonical solution length (331.6 tokens) and the highest average cyclomatic complexity (9.00) compared to existing benchmarks. Each task in REPOCOD includes 313.5 developer-written test cases on average for better correctness evaluation. In our evaluations of ten LLMs, none of the models achieve more than 30% pass@1 on REPOCOD, indicating the necessity of building stronger LLMs that can help developers in real-world software development. REPOCOD is available at https://github.com/lt-asset/REPOCOD
翻译:大型语言模型(LLMs)在 HumanEval 和 MBPP 等 Python 编程问题求解中已实现超过 90% 的 pass@1 高准确率。因此,一个自然的问题是:LLMs 在代码补全任务上是否已达到与人类开发者相当的水平?遗憾的是,现有的人工构建或简单(例如单行)代码生成基准无法回答此问题,因为此类任务未能反映真实世界的软件开发场景。此外,现有基准常采用不完善的代码正确性评估指标,导致结论存在误导性。为应对这些挑战,我们构建了 REPOCOD——一个包含 980 个编程问题的代码生成基准集,这些问题采集自 11 个热门真实项目,其中超过 58% 的问题需要文件级或仓库级上下文信息。与现有基准相比,REPOCOD 具有最长的平均规范解长度(331.6 个词元)和最高的平均圈复杂度(9.00)。REPOCOD 中每个任务平均包含 313.5 个开发者编写的测试用例,以实现更可靠的正确性评估。在对十种 LLMs 的评估中,所有模型在 REPOCOD 上的 pass@1 准确率均未超过 30%,这表明有必要构建更强大的 LLMs 以在实际软件开发中辅助开发者。REPOCOD 已发布于 https://github.com/lt-asset/REPOCOD