To adequately test modern code generation systems, evaluation benchmarks must execute and test the code generated by the system. However, these execution and testing requirements have largely limited benchmarks to settings where code is easily executable or has human-written tests. To facilitate evaluation of code generation systems across diverse scenarios, we present CodeBenchGen, a framework to create scalable execution-based benchmarks from naturally occurring code sources. Specifically, we leverage a large language model (LLM) to sandbox arbitrary pieces of code into evaluation examples, including test cases for execution-based evaluation. We illustrate the usefulness of our framework by creating a dataset, Exec-CSN, which includes 1,931 examples involving 293 libraries converted from code in 367 GitHub repositories taken from the Code- SearchNet dataset. To demonstrate the solvability of examples in Exec-CSN, we present a human study demonstrating that 81.3% of the examples can be solved by humans and 61% are rated as "requires effort to solve". We conduct code generation experiments on open-source and proprietary models and analyze the performance of both humans and models. We provide code and data at: https://github.com/yiqingxyq/CodeBenchGen.
翻译:为充分测试现代代码生成系统,评估基准必须执行并测试系统生成的代码。然而,这些执行与测试要求在很大程度上将基准限制在代码易于执行或已具备人工编写测试的场景中。为促进代码生成系统在多样化场景中的评估,本文提出CodeBenchGen框架,该框架能够从自然存在的代码源中创建可扩展的基于执行的基准。具体而言,我们利用大语言模型(LLM)将任意代码片段封装为评估样本,包括用于基于执行评估的测试用例。我们通过创建数据集Exec-CSN来展示该框架的实用性,该数据集包含1,931个样本,涉及从CodeSearchNet数据集的367个GitHub仓库代码转换而来的293个库。为证明Exec-CSN中样本的可解性,我们开展了人工研究,结果表明81.3%的样本可由人类解决,其中61%被评定为"需要付出努力才能解决"。我们对开源和专有模型进行了代码生成实验,并分析了人类与模型的性能表现。代码与数据发布于:https://github.com/yiqingxyq/CodeBenchGen。